# Data Cleaning

## Introduction

###### Data Gathering:define scope

- where are we going to get the data?
- how much? 
- which people? 
- what time range? 

###### Data Cleaning:
- clean : remove punctuation numbers , lowercase letters using __re__ a python library for regular expressions
- tokenize: smaller parts , the most common token size is a word , filter stop words like 'the' or 'a' , will end up
with a bag of words 
- put in a matrix that a machine can read


format 1 : __corpus = a collection of texts__ <br>
    Pandas: python library for data analysis <br>
    DataFrame: an object a table


format 2 : __document-term matrix__ <br>
    sckit-learn:python library for machine learning <br>
    count vectorizer to do the matrix



## Problem Statement

As a reminder, our goal is to look at tweets of various users and note their feelings towards the pandemic during 7 months from January to July 2020 

## Getting The Data

To get the data I used A Python 3 library and a corresponding command line utility for accessing old tweets called GetOldTweets3.
I took 500 tweets posted each month from January to July 2020 with the coronavirus word included

In [85]:
# Importing GetOldTweets3
import GetOldTweets3 as got
def get_tweets(startdate):
    tweetCriteria = got.manager.TweetCriteria().setQuerySearch("coronavirus")\
                                            .setSince(startdate)\
                                            .setUntil("2020-0"+str((int(startdate[6:7])+1))+"-01")\
                                            .setMaxTweets(500)\
                                            .setLang('en') 
    tweet = got.manager.TweetManager.getTweets(tweetCriteria)
    
    text_tweets = [tw.text for tw in tweet]
#     df_state= pd.DataFrame(text_tweets, columns = ['User', 'Text', 'Date', 'Hashtags'])
    
    return text_tweets
dates = ["2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", "2020-07-01"]
months = ["january", "february", "march", "april", "may", "june", "july"]

In [86]:
# # # Actually request text (takes a few minutes to run)
# tweets = [get_tweets(u) for u in dates]
# print(tweets)

In [87]:
# # Pickle files for later use
# import pickle
# # Make a new directory to hold the text files
# !mkdir tweets

# for i, m in enumerate(months):
#     with open("tweets/" + m + ".txt", "wb") as file:
#         pickle.dump(tweets[i], file)

In [88]:
# Load pickled files
import pickle

data = {}
for i, m in enumerate(months):
    with open("tweets/" + m + ".txt", "rb") as file:
        data[m] = pickle.load(file)

In [89]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['january', 'february', 'march', 'april', 'may', 'june', 'july'])

In [90]:
# More checks
data['april']

["Calls for private health sector to hand back 'very substantial unexpected profit' during coronavirus crisis ",
 'Supercharge #nonprofit coronavirus fundraising with 4 easy steps. https://blog.candid.org/post/supercharge-nonprofit-coronavirus-fundraising-with-4-easy-steps/ via @CandidDotOrg',
 'Why are you all so late to the @BorisJohnson is an egomaniac, lazy, drunkard, sociopathic, feckless bastard party? Ive put the dishwasher on, hoovered up and hoping that your not crashing on my settee. #BorisJohnson #COVID19 #coronavirus https://twitter.com/mrjamesob/status/1255902627356966915',
 '“Like OMG, I cant believe I got coronavirus!” -these beach morons next week ',
 'Apple sales inch higher despite coronavirus but CEO Tim Cook sees uncertain future ',
 'New York reportedly paid $69 million for ventilators to an engineer with no background in medical supplies at the recommendation of the White House coronavirus task force ',
 'This is just wrong! China asked overseas Chinese to help st

## Cleaning The Data

**cleaning steps that I used:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [91]:
# Let's take a look at our data again
next(iter(data.keys()))

'january'

In [92]:
# Notice that our dictionary is currently in key: month, value: list of text format
next(iter(data.values()))

['Unless Corona virus cancels the league #Dreamcatcher',
 'I told a few dumb liberals that the Corona virus was coming from Mexico. Now they want to build the wall.',
 'Many travel companies are limiting their operations to reduce the impact of the coronavirus outbreak on their financials while helping to safeguard public health. ',
 '“The fatality rate [of the coronavirus] is 200/10,000, which is currently lower compared to many other viruses including SARS, so if it was meant as a bioweapon, it is not a good one." ',
 'US bars entry to foreigners who traveled to China http://bit.ly/2Oj0amd #coronavirus',
 "Why B.C.'s top doctor doesn't want to impose a coronavirus quarantine | CTV News ",
 'Corona Virus Outbreak Linked To Animal Meat Market That Sold Live Rats, Snakes, And Wolf Cubs ',
 'this coronavirus got to me, say your farewells now ',
 'People who come on here to fear monger and be racist re coronavirus couldnt even describe its understand the mechanism of ssRNA viruses and it 

In [93]:
# We are going to change this to key: month, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [94]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [95]:
# We can either keep it in dictionary format or put it into a pandas dataframe
# import pandas as pd
# text_tweets = [[tw.username,
#             tw.text,
#             tw.date,
#             tw.hashtags
#           ] for tw in tweets]
# data_df= pd.DataFrame(text_tweets, columns = ['User', 'Text', 'Date', 'Hashtags'])
# data_df
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['text']
data_df = data_df.sort_index()
data_df

Unnamed: 0,text
april,Calls for private health sector to hand back 'very substantial unexpected profit' during coronavirus crisis Supercharge #nonprofit coronavirus fu...
february,I think this Coronavirus outbreak has you overthinking things. What a blatantly #racist tweet. How many cases of #coronavirus in Mexico or Canada...
january,Unless Corona virus cancels the league #Dreamcatcher I told a few dumb liberals that the Corona virus was coming from Mexico. Now they want to bui...
july,"FYI... Florida Surpasses New York In Coronavirus Case Tally As US Set To See Deaths Top 1,000 For 5th Day The weekend is getting off to a rough st..."
june,#Sport #Care Access continue to plan and develop projects assisting people to move forward in their lives. With devastating and horrific effects o...
march,These are NOT Christians. These are demons willing to risk the lives of the vulnerable who give these frauds attention. This man should be arreste...
may,"Coronavirus &amp; Your Indoor Air: Are You At Risk At Home? This is fucking scary, putting your,ice at risk to support a cause. Now both cops and..."


In [96]:
# Let's take a look at the text for may
data_df.text.loc['may']



In [97]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)

    return text

round1 = lambda x: clean_text_round1(x)

In [98]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.text.apply(round1))
data_clean.text.loc['may']



In [99]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of corona virus word.'''
    text = re.sub('corona|virus|coronavirus|covid|case|cases|new|pandemic', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [100]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.text.apply(round2))
data_clean.text

april       calls for private health sector to hand back very substantial unexpected profit during  crisis  supercharge nonprofit  fundraising with  easy step...
february    i think this  outbreak has you overthinking things  what a blatantly racist tweet how many s of  in mexico or canada  as hotel occupancy plunges t...
january     unless   cancels the league dreamcatcher i told a few dumb liberals that the   was coming from mexico now they want to build the wall many travel ...
july        fyi florida surpasses  york in   tally as us set to see deaths top  for  day the weekend is getting off to a rough start   live updates nigeria so...
june        sport care access continue to plan and develop projects assisting people to move forward in their lives with devastating and horrific effects of  ...
march       these are not christians these are demons willing to risk the lives of the vulnerable who give these frauds attention this man should be arrested ...
may          amp your indoor

## Organizing The Data

I mentioned earlier that the output of this notebook will be clean, organized data in two standard text formats:
1. **Corpus - **a collection of text
2. **Document-Term Matrix - **word counts in matrix format

### Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [101]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,text
april,Calls for private health sector to hand back 'very substantial unexpected profit' during coronavirus crisis Supercharge #nonprofit coronavirus fu...
february,I think this Coronavirus outbreak has you overthinking things. What a blatantly #racist tweet. How many cases of #coronavirus in Mexico or Canada...
january,Unless Corona virus cancels the league #Dreamcatcher I told a few dumb liberals that the Corona virus was coming from Mexico. Now they want to bui...
july,"FYI... Florida Surpasses New York In Coronavirus Case Tally As US Set To See Deaths Top 1,000 For 5th Day The weekend is getting off to a rough st..."
june,#Sport #Care Access continue to plan and develop projects assisting people to move forward in their lives. With devastating and horrific effects o...
march,These are NOT Christians. These are demons willing to risk the lives of the vulnerable who give these frauds attention. This man should be arreste...
may,"Coronavirus &amp; Your Indoor Air: Are You At Risk At Home? This is fucking scary, putting your,ice at risk to support a cause. Now both cops and..."


In [102]:
# Let's add the months' full names as well
full_names = ["april", "february", "january", "july", "june", "march", "may"]

data_df['month'] = full_names
data_df

Unnamed: 0,text,month
april,Calls for private health sector to hand back 'very substantial unexpected profit' during coronavirus crisis Supercharge #nonprofit coronavirus fu...,april
february,I think this Coronavirus outbreak has you overthinking things. What a blatantly #racist tweet. How many cases of #coronavirus in Mexico or Canada...,february
january,Unless Corona virus cancels the league #Dreamcatcher I told a few dumb liberals that the Corona virus was coming from Mexico. Now they want to bui...,january
july,"FYI... Florida Surpasses New York In Coronavirus Case Tally As US Set To See Deaths Top 1,000 For 5th Day The weekend is getting off to a rough st...",july
june,#Sport #Care Access continue to plan and develop projects assisting people to move forward in their lives. With devastating and horrific effects o...,june
march,These are NOT Christians. These are demons willing to risk the lives of the vulnerable who give these frauds attention. This man should be arreste...,march
may,"Coronavirus &amp; Your Indoor Air: Are You At Risk At Home? This is fucking scary, putting your,ice at risk to support a cause. Now both cops and...",may


In [103]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [104]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.text)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,aag,aarefajohari,aayega,abandoned,abandons,abattoirs,abbott,abbotts,abc,abcs,...,來自,武汉肺炎,고양이,고양이스타그램,궈낙아쉥일축하훼,깍두기,제가,페르시아,𝗔𝘂𝘁𝗵𝗼𝗿𝘀,𝗧𝗶𝘁𝗹𝗲
april,0,0,1,3,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
february,0,0,0,0,0,0,0,0,0,1,...,1,1,0,0,0,1,1,1,0,0
january,0,0,0,0,0,0,0,0,2,1,...,0,0,0,0,0,0,0,0,0,0
july,0,0,0,0,0,1,0,0,5,0,...,0,0,0,0,1,0,0,0,0,0
june,1,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,1,1
march,0,0,0,0,1,0,0,1,0,0,...,0,0,1,1,0,0,0,0,0,0
may,0,1,0,0,0,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0


In [105]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [106]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))