## The global warming issue and Narratives around it<br>
### Part 2: Cleaning the imported data and doing a brief EDA for early assessment. Finally pickling the merged dataframe into a global dataframe

In this notebook, I cleaned the imported API dataframe and saved it as a clean version into "../datasets" folder for further processing.

Importing the required libraries:

In [1]:
#imports
import pandas as pd
import regex as re
import warnings
warnings.filterwarnings('ignore')
from nltk.corpus import stopwords # Import the stopword list

from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

import pickle

### Part 2.1: Importing the saved raw data from reddit API and cleaning

In [2]:
#Global warming
file_path = "../datasets/" + "GlobalWarming" + "_raw" + ".csv"
df_gw = pd.read_csv(file_path)

# Keeping only a few columns which will be helpful during analysis
to_keep_clmns = ['author', 'created_utc', 'domain', 'id', 'num_comments', 'over_18',
       'post_hint', 'score', 'selftext',
       'title']

df_gw_clean = df_gw[to_keep_clmns]


df_gw_clean.head(10)

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,selftext,title
0,Kafka15,1593554514,i.redd.it,hixbtf,2,False,image,1,,Cum
1,karan_negiiiii,1593497051,boringworld.org,hihj6s,0,False,,1,,Climate Change in india.
2,Hildavardr,1593479932,self.GlobalWarming,hidb5h,0,False,,1,[removed],Global warming and the responsibility of big I...
3,pEppapiGistfuhrer,1593455137,i.redd.it,hi5h41,1,False,image,1,,Ayy lets stop global warming
4,BrexitBlaze,1593455005,theguardian.com,hi5feq,2,False,link,2,,UK ministers send mixed messages over climate ...
5,ManesJr,1593451635,i.redd.it,hi493l,1,False,image,1,,Mmm yes
6,-i-love-downvotes-,1593433887,i.redd.it,hhyzxy,1,False,image,1,,Oh yeeaahh
7,BrexitBlaze,1593425020,bbc.com,hhx644,1,False,link,1,,"Extra £14bn needed a year for climate, report ..."
8,Robo4575,1593401210,self.climate,hhsg4o,3,False,link,2,,Check out these global warming t shirts
9,ballzy94,1593370971,youtube.com,hhk71m,0,False,rich:video,1,,Technologies Protecting our Trees.


In [3]:
df_gw_clean.shape

(3934, 10)

In [4]:
df_gw_clean.isnull().sum()

author             0
created_utc        0
domain             0
id                 0
num_comments       0
over_18            0
post_hint       2828
score              0
selftext        2815
title              0
dtype: int64

Imputation time: Imputing the useful columns, dropping the useless columns, which also have many missing values.

In [5]:
#For title and selftext columns, I filled them with " " as they will be striped later, so I can merge them later.

df_gw_clean["title"].fillna(" ", inplace=True)
df_gw_clean["selftext"].fillna(" ", inplace=True)

#Merging the title and selftext for further processing

df_gw_clean['text_merged'] = df_gw_clean['title'] + " " + df_gw_clean['selftext']
df_gw_clean.drop(columns = ["title", "selftext"], inplace=True)

#For post_hint, I imputed them with "Empty"
df_gw_clean['post_hint'].fillna("Empty", inplace=True)

Checking the datatypes and also whether all columns have no NaN values:

In [6]:
df_gw_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3934 entries, 0 to 3933
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   author        3934 non-null   object
 1   created_utc   3934 non-null   int64 
 2   domain        3934 non-null   object
 3   id            3934 non-null   object
 4   num_comments  3934 non-null   int64 
 5   over_18       3934 non-null   bool  
 6   post_hint     3934 non-null   object
 7   score         3934 non-null   int64 
 8   text_merged   3934 non-null   object
dtypes: bool(1), int64(3), object(5)
memory usage: 249.8+ KB


And, checking the final dataframe produced:

In [7]:
df_gw_clean.head()

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,text_merged
0,Kafka15,1593554514,i.redd.it,hixbtf,2,False,image,1,Cum
1,karan_negiiiii,1593497051,boringworld.org,hihj6s,0,False,Empty,1,Climate Change in india.
2,Hildavardr,1593479932,self.GlobalWarming,hidb5h,0,False,Empty,1,Global warming and the responsibility of big I...
3,pEppapiGistfuhrer,1593455137,i.redd.it,hi5h41,1,False,image,1,Ayy lets stop global warming
4,BrexitBlaze,1593455005,theguardian.com,hi5feq,2,False,link,2,UK ministers send mixed messages over climate ...


In [8]:
df_gw_clean.loc[15,"text_merged"]

'If we stop corona, we die of gw. If we stop gw. We die of corona [removed]'

#### **Everything looks good here !**

### Part 2.2: Importing the saved raw data from reddit API and cleaning

In [9]:
# ConspiracyTheory
file_path = "../datasets/" + "ConspiracyTheory" + "_raw" + ".csv"
df_ct = pd.read_csv(file_path)

# Keeping only a few columns which will be helpful during analysis
df_ct_clean = df_ct[to_keep_clmns]


df_ct_clean.head(2)

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,selftext,title
0,LonelyHampster,1581513470,self.ConspiracyTheory,f2r2s5,22,False,,1,Recently me and some of my friends have been n...,"Jimmy Fallon might be in the closet gay, or Bi..."
1,Switchkillengaged,1581312224,self.ConspiracyTheory,f1lp26,6,False,,1,We've always known the president to be a good ...,Theory: Illuminati got Trump in office to crea...


In [10]:
df_ct_clean.shape

(894, 10)

Cleaning the data and getting the texts ready for process:

In [11]:
df_ct_clean.isnull().sum()

author            0
created_utc       0
domain            0
id                0
num_comments      0
over_18           0
post_hint       573
score             0
selftext        542
title             0
dtype: int64

Imputation time: Imputing the useful columns, dropping the useless columns, which also have many missing values.

In [12]:
#For title and selftext columns, I filled them with " " as they will be striped later, so I can merge them later.

df_ct_clean["title"].fillna(" ", inplace=True)
df_ct_clean["selftext"].fillna(" ", inplace=True)

#Merging the title and selftext for further processing

df_ct_clean['text_merged'] = df_ct_clean['title'] + " " + df_ct_clean['selftext']
df_ct_clean.drop(columns = ["title", "selftext"], inplace=True)

#For post_hint, I imputed them with "Empty"
df_ct_clean['post_hint'].fillna("Empty", inplace=True)

In [13]:
df_ct_clean.loc[0, "text_merged"]

'Jimmy Fallon might be in the closet gay, or Bisexual Recently me and some of my friends have been noticing How Jimmy Fallon looks at men on tv. Him checking out Terry crews when he is shirtless. The way he looks at men and woman similarly. He is ok with cross dressing, and often acts like a preteen girl. He shows many signs on the gaydar. Although it is not for sure. This is a theory that me and my fellow lgbt + friends are in love with because we love Jimmy Fallon and that he is not afraid to be himself. Even if he is straight.'

Checking the datatypes and also whether all columns have no NaN values:

In [14]:
df_ct_clean.head()

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,text_merged
0,LonelyHampster,1581513470,self.ConspiracyTheory,f2r2s5,22,False,Empty,1,"Jimmy Fallon might be in the closet gay, or Bi..."
1,Switchkillengaged,1581312224,self.ConspiracyTheory,f1lp26,6,False,Empty,1,Theory: Illuminati got Trump in office to crea...
2,makiababi,1581266472,youtu.be,f1b44j,2,False,rich:video,1,NBA Players who are Part of the Secret Illumin...
3,Raven9nine9,1581179235,self.ConspiracyTheory,f0tsly,15,False,Empty,1,Evidence that Suggests Wuhan Market was Not th...
4,finnagains,1581095126,i.redd.it,f0dddo,0,False,image,1,"US Navy Vet - Worked in Afghanistan, Iraq, Sud..."


In [15]:
df_ct_clean.loc[2,"text_merged"]

'NBA Players who are Part of the Secret Illuminati  '

#### **Everything looks good here too !**

One last step is to combine dataframes into a single one:

In [16]:
#Adding one column to determine the subreddit pulled from
df_gw_clean["subreddit"] = "GlobalWarming"
df_ct_clean["subreddit"] = "ConspiracyTheory"

df_reddit = pd.concat([df_gw_clean, df_ct_clean], axis = 0, ignore_index=True)


df_reddit.head(5)

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,text_merged,subreddit
0,Kafka15,1593554514,i.redd.it,hixbtf,2,False,image,1,Cum,GlobalWarming
1,karan_negiiiii,1593497051,boringworld.org,hihj6s,0,False,Empty,1,Climate Change in india.,GlobalWarming
2,Hildavardr,1593479932,self.GlobalWarming,hidb5h,0,False,Empty,1,Global warming and the responsibility of big I...,GlobalWarming
3,pEppapiGistfuhrer,1593455137,i.redd.it,hi5h41,1,False,image,1,Ayy lets stop global warming,GlobalWarming
4,BrexitBlaze,1593455005,theguardian.com,hi5feq,2,False,link,2,UK ministers send mixed messages over climate ...,GlobalWarming


In [17]:
df_reddit.shape

(4828, 10)

In [18]:
df_reddit["text_merged"][4548]

"Is the earth flat? Yesterday was the first time that I had actually heard the argument for The Flat Earth Society and I can't say that I'm totally against it. I know it seems a little far fetched at times, but then again I can see where they are coming from. Does anyone here know/have anything they would want to add that might be useful for trying to decided where I stand in all of this? "

### Now, doing cleaning on the merged reddit dataframe

In [19]:
#Lots of cleaning on text

def text_cleaning(item):

    #Removing "\n" characters
    item = re.sub("\n", " ", item)
    #Removing the [removed] characters
    item = item.replace("[removed]", " ")
    # Use regular expressions to do a find-and-replace
    item = re.sub("[^a-zA-Z]", " ", item)
    #Making all characters lower case
    item = item.lower()
    #Replacing multiple spaces
    item = " ".join(item.split())
    #Removing stopwords
    stops = stopwords.words("english")
    words = [w for w in item.split() if w not in stops]#stops
    # Instantiate object of class PorterStemmer and stemming.
    p_stemmer = PorterStemmer()
    words = [p_stemmer.stem(i) for i in words]
    # Adding space to stitch the words together
    words = " ".join(list(words)) 
    
    return words

df_reddit["text_merged"] = df_reddit["text_merged"].apply(text_cleaning)



#Stemming


df_reddit.reset_index(drop=True, inplace=True)

In [20]:
df_reddit.head(5)

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,text_merged,subreddit
0,Kafka15,1593554514,i.redd.it,hixbtf,2,False,image,1,cum,GlobalWarming
1,karan_negiiiii,1593497051,boringworld.org,hihj6s,0,False,Empty,1,climat chang india,GlobalWarming
2,Hildavardr,1593479932,self.GlobalWarming,hidb5h,0,False,Empty,1,global warm respons big compani,GlobalWarming
3,pEppapiGistfuhrer,1593455137,i.redd.it,hi5h41,1,False,image,1,ayi let stop global warm,GlobalWarming
4,BrexitBlaze,1593455005,theguardian.com,hi5feq,2,False,link,2,uk minist send mix messag climat commit say fu...,GlobalWarming


In [21]:
df_reddit["text_merged"][50]

'anomal warm temperatur arctic siberia may'

In [22]:
df_reddit.shape

(4828, 10)

Pickling the dataframe as they are large!

In [23]:
pickle.dump(df_reddit, open('../datasets/df_reddit.pkl', 'wb'))

In [24]:
print("Hello world!")

Hello world!
