### Repositories to add:
- TextBlob
- Pandas

### Description:
This first piece of code is a  function reads in the raw text data and generates appropriate dataframes for each genre. The structure of the dataframe is described later in the notebook. The data is stored in a pickled format.

In [1]:
def generate_DF(year_begin=2015, year_end=2019):
    '''
    Description:
    Generating the dataframes for the genres and various features based on the raw text files
    
    Input:
    year_begin = int; numerical year from which the analysis is to start
    year_end = int; numerical year at the end of which the analysis is to terminate
    '''
    import os
    import pandas as pd
    from textblob import TextBlob
    from datetime import datetime
    from statistics import mean, median
    from textblob import Word
    import pickle
    genre_list = os.listdir('data')
    if '.DS_Store' in genre_list:
            genre_list.remove('.DS_Store')
    month_categories = list(range(1, (year_end-year_begin+1)*12 + 1)) # 2015 - 2019 (both inclusive)
    data = dict()
    for genre in genre_list:
        name = genre[0: genre.find('_dataFrame.txt')]
        df_input = pd.read_csv(os.path.join('data', genre))
        df = pd.DataFrame(columns=["words", "frequency_time", "likes", "likes_mean", "likes_median", "dislikes",
                                   "dislikes_mean", "dislikes_median", "views", "views_mean", "views_median",
                                   "polarity", "subjectivity"])
        for n in range(0, len(df_input)):
            try:
                date = datetime.strptime(df_input.date[n], "%b %d, %Y")
            except:
                continue
            date_score = (date.year - year_begin)*12 + date.month
            total_text = str(df_input.title[n]) + " " + str(df_input.description[n])
            Blob = TextBlob(total_text)
            Blob = Blob.lower()
            for word in Blob.words:
                if len(Word(word).define())==0:
                    continue
                if date_score > month_categories[-1] or date_score < month_categories[0]:
                    continue
                if word in list(df.words):
                    word_index = df.words[df.words==word].index[0]
                    df.frequency_time[word_index][date_score-1]+=1
                    df.likes[word_index].append(df_input.likes[n])
                    df.dislikes[word_index].append(df_input.dislikes[n])
                    df.views[word_index].append(df_input.views[n])
                else:
                    freq_t = [0]*len(month_categories)
                    freq_t[month_categories.index(date_score)] += 1
                    likes = [df_input.likes[n]];
                    dislikes = [df_input.dislikes[n]]
                    views = [df_input.views[n]]
                    df = df.append(pd.DataFrame([[word, freq_t, likes, 0, 0, dislikes, 0, 0, views,
                                                   0, 0, TextBlob(word).polarity, TextBlob(word).subjectivity]],
                                                 columns=list(df.columns)), ignore_index=True)
        for n in range(0, len(df)):
            df.loc[n, 'likes_mean'] = mean(df.likes[n])
            df.loc[n, 'likes_median'] = median(df.likes[n])
            df.loc[n, 'dislikes_mean'] = mean(df.dislikes[n])
            df.loc[n, 'dislikes_median'] = median(df.dislikes[n])
            df.loc[n, 'views_mean'] = mean(df.views[n])
            df.loc[n, 'views_median'] = median(df.views[n])
        df.drop(columns=['likes', 'dislikes', 'views'], inplace=True)
        data[name] = df
        pickle.dump(data, open('data_project.p', 'wb'))

Here's how to use this function. Note that it automatically writes the dataframes.

In [2]:
generate_DF()

### Description:
This second piece of code is a way to load the data using pickle.

In [3]:
import pickle

In [4]:
data = pickle.load(open('data_project.p', 'rb'))

So 'data' is a dictionary where the keys are the genres and the values are the corresponding dataframes.

In [5]:
data.keys()

dict_keys(['cooking', 'influencers', 'gaming'])

Here's an example of the dataframe under 'cooking'.

In [6]:
data['cooking']

Unnamed: 0,words,frequency_time,likes_mean,likes_median,dislikes_mean,dislikes_median,views_mean,views_median,polarity,subjectivity
0,binging,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",98063,90976,1307,1050,5974777,5138365,0.0,0.0
1,pie,"[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, ...",27906,5827,497,99,1731450,816877,0.0,0.0
2,love,"[1, 3, 2, 7, 5, 4, 2, 2, 3, 2, 1, 6, 1, 0, 3, ...",35093,24134,393,161,1786182,821229,0.5,0.6
3,actually,"[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",66876,37769.5,1146,676,2512649,2.57215e+06,0.0,0.1
4,happy,"[0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, ...",23188,4967.5,285,57.5,1699886,552900,0.8,1.0
...,...,...,...,...,...,...,...,...,...,...
6808,darkness,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",22573,22573,209,209,1825511,1825511,0.0,0.0
6809,haunts,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",22573,22573,209,209,1825511,1825511,0.0,0.0
6810,synchronize,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",30682,30682,314,314,2894904,2894904,0.0,0.0
6811,worst,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",58005,58005,1694,1694,1688890,1688890,-1.0,1.0


Here, the rows represent the words, and the columns represent their respective data points.

Let's look at an example row.

In [7]:
data['cooking'].loc[0, :]

words                                                        binging
frequency_time     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
likes_mean                                                     98063
likes_median                                                   90976
dislikes_mean                                                   1307
dislikes_median                                                 1050
views_mean                                                   5974777
views_median                                                 5138365
polarity                                                           0
subjectivity                                                       0
Name: 0, dtype: object

Most of the columns are self explanatory: likes_mean is the mean number of likes that this word is associated with, likes_median is the median number of likes that this word is associated with, dislikes_mean is the mean number of dislikes that this word is associated with, and so on.

The only complex column is 'frequency_time'. It is a list where each element describes the frequency of the word used per month of a year. As is mentioned in the function 'generate_DF', the 'year_begin' and 'year_end' arguments decide the number of elements in 'frequency_time'. For example, if year_begin=2004 and year_end=2006, then number of elements in 'frequency_time' would be (2006-2004+1)*12=36

### Description:
This third piece of code represents a function that analyzes a given piece of text based on a given genre and metric, generates a word-by-word categorization of the string (either "red", "yellow", "green", or "white": for meaning of each, look at the documentation) and an overall score for the text for the given metric.

In [8]:
def analyze_text(text, genre, metric):
    '''
    Purpose: 
    Analyze the given text and produce color labels for the words as well as generate an overall score based on the
    given genre and metric
    
    Input:
    text = str; scalar depicting the text that needs to be analyzed
    genre = str; scalar depicting the genre of the content: "cooking", "gaming", "influencers"
    metric = str; scalar depicting the metric to base the analysis on: "likes_mean", "likes_median",
             "dislikes_mean", "dislikes_median", "views_mean", "views_median", "polarity", "subjectivity"
    
    Output:
    analysis = list; a list with the same number of elements as number of words in given text, with each
               corresponding element being the color for that word: "red" means bad, "yellow" means okay, "green"
               means good and "white" means "Not found" (in database)
    score_avg = float; average value of the score: float or "Not applicable" (if none of the words matched the
                database)
               
    '''
    from textblob import TextBlob
    import pickle
    pickle.load(open('data_project.p', 'rb'))
    Blob = TextBlob(text)
    scores = []
    df_genre = data[genre]
    score_avg = 0
    counter = 0
    for word in list(Blob.words):
        if word in list(df_genre.words):
            word_index = df_genre.words[df_genre.words==word].index[0]
            scores.append(df_genre[metric][word_index])
            score_avg += df_genre[metric][word_index]
            counter += 1
        else:
            scores.append("Not found")
    if counter > 0:
        score_avg = score_avg/counter
    else:
        score_avg = "Not Applicable"
    intervals = [df_genre[metric].mean()-df_genre[metric].std(), df_genre[metric].mean()+df_genre[metric].std()]
    analysis = []
    for score in scores:
        if score=="Not found":
            analysis.append("white")
            continue
        if score<=intervals[0]:
            analysis.append("red")
        elif score>intervals[0] and score<=intervals[1]:
            analysis.append("yellow");
        elif score>intervals[1]:
            analysis.append("green")
    return analysis, score_avg

Here's an example of how to use this.

In [9]:
categorization, overall_score = analyze_text("Hi, today we will cook pork", "cooking", "likes_mean")

In [10]:
categorization

['white', 'yellow', 'white', 'yellow', 'yellow', 'yellow']

In [11]:
overall_score

29914.25