# Data cleaning and preprocessing

Since we gathered the data ourselves, the data is pretty clean.  However there is a lot of preprocessing to do.  First we recreate the data frame from [ProblemStatement](ProblemStatement.ipynb), this time adding titles to the columns.

In [1]:
# Import pandas to make the data frame
import pandas as pd

# Rereate the data frame using the output file from ProblemStatement
# Render the data frame
raw_data = pd.read_csv("ProblemStatementOutputs/50_25-03-2023_17-09-02.csv")
# Add headers to the columns
raw_data.columns = ["Post type", "Title", "Post url", "Author(s)", "Date and time posted", "Tags", "No. of comments"]
# Show the data frame
raw_data

Unnamed: 0,Post type,Title,Post url,Author(s),Date and time posted,Tags,No. of comments
0,article,"Unions Have Been Under Attack For Decades, But...",https://fivethirtyeight.com/features/unions-ha...,Monica Potts,"Mar. 24, 2023, at 3:37 PM",Partisanship; State Legislatures; Labor; Unions,7
1,article,Which Taylor Swift Album Is The Most Popular?,https://fivethirtyeight.com/features/taylor-sw...,Nathaniel Rakich,"Mar. 24, 2023, at 6:00 AM",Polling; Polls; Pollapalooza; Pollsters; Music...,4
2,podcast,Will Voters Care If Trump Gets Indicted?,https://fivethirtyeight.com/videos/will-voters...,Galen Druke; Amelia Thomson-DeVeaux; Nathaniel...,"Mar. 23, 2023",Donald Trump; Politics Podcast; Video; 2024 Pr...,13
3,podcast,Politics Podcast: The Manhattan DA Might Be Th...,https://fivethirtyeight.com/features/politics-...,Galen Druke; Amelia Thomson-DeVeaux; Nathaniel...,"Mar. 23, 2023",Donald Trump; FiveThirtyEight Podcasts; Politi...,3
4,video,Why Is Biden Moving To The Political Center?,https://fivethirtyeight.com/videos/why-is-bide...,Nathaniel Rakich; Tony Chow,"Mar. 23, 2023",Video; Joe Biden; Polling; 2022 Election; Feat...,15
5,article,Recess Is Good For Kids. Why Don’t More States...,https://fivethirtyeight.com/features/recess-is...,Monica Potts,"Mar. 23, 2023, at 6:00 AM",Education; State Legislatures; Schools; Child ...,6
6,article,What We Know About Trump’s Legal Troubles,https://fivethirtyeight.com/features/what-we-k...,Amelia Thomson-DeVeaux,"Mar. 22, 2023, at 12:57 PM",Donald Trump; 2020 Election; Jan. 6 Investigat...,11
7,article,What Happens If Trump Is Indicted?,https://fivethirtyeight.com/features/trump-ind...,Nathaniel Rakich; Amelia Thomson-DeVeaux; Geof...,"Mar. 21, 2023, at 1:16 PM",Donald Trump; Slack Chat; Scandals; 2024 Elect...,13
8,article,States Are Cracking Down On Militias — Except ...,https://fivethirtyeight.com/features/state-mil...,Kaleigh Rogers,"Mar. 21, 2023, at 6:00 AM",Oregon; New Mexico; State Legislatures; Vermon...,10
9,podcast,Bank Bailout Politics 2.0,https://fivethirtyeight.com/videos/bank-bailou...,Galen Druke; Nate Silver; Santul Nerkar; Monic...,"Mar. 20, 2023",Politics Podcast; Video; Bailout; Financial Re...,6


Now we get the total number of comments per tag, and the average number of comments per post for each tag.

In [2]:
# Input is one of the column titles from the raw_data data frame, as a string.
# Needs the raw_data data frame
# Output is a data frame containing the frequency, total number of comments, and average number of comments.
def comments_analysis(attribute):
    #print("Data for "+attribute+":") # for debugging
    all_attr_with_repeats = []
    for feature_attr in raw_data[attribute]:
        with_commas = feature_attr.split("; ")
        for attr in with_commas:
            all_attr_with_repeats.append(attr) 
    all_attr_list = [*set(all_attr_with_repeats)]
    #print("Possible values:\n", all_attr_list) # for debugging
    attr_frequency_list = []
    attr_total_comments_list = []
    attr_avg_no_of_comments_list = []
    for i in range(len(all_attr_list)):
        #print("Iteration "+str(i)+".  Attribute is "+all_attr_list[i]+".") # for debugging
        #print("Here is the nested loop.") # for debugging
        occurences = 0
        comments = 0
        for index, row in raw_data.iterrows():
            #print("The index is "+str(index)+".") # for debugging
            attr_in_row_list = str(row[attribute]).split("; ")
            #print("The attrs are:\n"+str(attr_in_row_list)+".") # for debugging
            #print("The Boolean value of the if statement is "+ str(all_attr_list[i] in attr_in_row_list)+".") # for debugging
            if all_attr_list[i] in attr_in_row_list:
                #print("Before occurences = "+str(occurences)+ " and comments = "+ str(comments)+".") # for debugging
                occurences = occurences +1
                comments = comments + int(row["No. of comments"])
                #print("Now occurences = "+str(occurences)+ " and comments = "+ str(comments)+".") # for debugging
        #print("Exiting the nested loop.") # for debugging
        #print("Results:  occurences = "+str(occurences)+" and comments = "+str(comments)+".") # for debugging
        attr_frequency_list.append(occurences)
        attr_total_comments_list.append(comments) 
        avg = int(attr_total_comments_list[i])/int(attr_frequency_list[i])
        #print("Finally, avg = "+str(avg)+".\n") # for debugging
        attr_avg_no_of_comments_list.append(avg)
    attr_info_dict = {
        attribute:all_attr_list, 
        "Frequency":attr_frequency_list, 
        "Total no. of comments":attr_total_comments_list,
        "Avg no. of comments":attr_avg_no_of_comments_list
    }    
    return pd.DataFrame(attr_info_dict, columns = [attribute, "Frequency", "Total no. of comments", "Avg no. of comments"])

In [3]:
# Try the function
authors_info = comments_analysis("Author(s)")
authors_info

Unnamed: 0,Author(s),Frequency,Total no. of comments,Avg no. of comments
0,Amelia Thomson-DeVeaux,13,97,7.461538
1,Nathaniel Rakich,18,156,8.666667
2,Alex Samuels,4,48,12.0
3,Geoffrey Skelley,7,61,8.714286
4,Galen Druke,17,94,5.529412
5,Kaleigh Rogers,6,69,11.5
6,Julian Kim,1,15,15.0
7,Monica Potts,8,66,8.25
8,Maggie Koerth,2,21,10.5
9,Nate Silver,11,58,5.272727


In [4]:
# Try the function
tags_info = comments_analysis("Tags")
tags_info

Unnamed: 0,Tags,Frequency,Total no. of comments,Avg no. of comments
0,Militias,1,10,10.000000
1,Marianne WIlliamson,3,31,10.333333
2,Facebook,1,4,4.000000
3,Federalism,1,10,10.000000
4,2024 Democratic Primary,2,28,14.000000
...,...,...,...,...
134,Iraq,1,4,4.000000
135,College,1,10,10.000000
136,Campaign Finance,1,6,6.000000
137,Extremism,1,8,8.000000


In [5]:
# Try the function
post_type_info = comments_analysis("Post type")
post_type_info

Unnamed: 0,Post type,Frequency,Total no. of comments,Avg no. of comments
0,article,29,285,9.827586
1,video,4,43,10.75
2,podcast,17,94,5.529412
