# Cleaning data
To clean the data all of the non-words need to be removed so the data can be understood better. 

### Imports
* re - regular expression to remove unwanted words and characters
* pickle - to pickle the data for later
* pandas - to represent the data in a dataframe
* string - help with the regular expression and removing punctuation
* sklearn - turn into a Document term matrix



In [1]:
import re
import pickle
import pandas as pd
import string
from sklearn.feature_extraction.text import CountVectorizer


In [2]:
# Load up the pickle dataframe with the raw data
with open('twitterBias/notebooks/pickle/rawDataFrame.p', 'rb') as f:
    df = pickle.load(f)

In [6]:
def clean_text_round1(text):
    text = text.lower()
    text = re.sub(r"http\S+", " ", text)
    text = re.sub(r"https\S+", " ", text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    #text = re.sub('\w*\d\w*', '', text)
    return text
round1 = lambda x: clean_text_round1(x)

In [4]:
def clean_text_round2(text):
    EMOJI_PATTERN = re.compile(
        "["
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F700-\U0001F77F"  # alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"  # Chess Symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "\U000024C2-\U0001F251" 
        "]+")
    text = re.sub(EMOJI_PATTERN,r' ', text)
    return text
round2 = lambda x: clean_text_round2(x)

In [7]:
data_clean = pd.DataFrame(df.tweets.apply(round1))
data_clean = pd.DataFrame(data_clean.tweets.apply(round2))

# Save the clean corpus to a pickle file
with open ('twitterBias/notebooks/pickle/corpus.p', 'wb') as f:
    pickle.dump(data_clean, f)

In [8]:
print(data_clean.tweets[0])


yobeccz even when i think i’m eating sugar  mfp says i ate sugarthe best receiver in the nfl   kim’s convenience after a football sunday that couldn’t have gone better   jagster7985 can i chop it up with you guys i’m chomping at the bit waiting for my cs4 review code trevarnold ikepackers especially in a system like matt lafleur’s where it’s hard for a rookie to pick it up  the best draft pick on the field today might’ve even been josiah deguara  nobody called thattrevarnold ikepackers can’t look back at that now remember we picked 3 receivers in the draft two years ago 2 are still on the team the time to rate that draft is after this year  they’re not meant to be judged immediatelytrevarnold ikepackers the achilles heel aside from the poor run defense  aaron rodgers not playing like the bad man that he iseven san francisco folded because of talent deficiency at qb once mahomes started rollingtrevarnold ikepackers sf did that to min too last year  and they supposedly had a good run def

In [9]:

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.tweets)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
#data_dtm=data_dtm.transpose()

data_dtm

Unnamed: 0,000,000we,003,010,02,0400,078,09,0g,0n,...,óscar,óscarsimongbrooks,öyster,última,þórisson,петер,термéн,საქართველო,ᵃʳᵉ,ᵗʰᵉʸ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
386,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
387,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
388,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
389,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# pickle the data for later
data_dtm.to_pickle("twitterBias/notebooks/pickle/dtm.pkl")

# Final results
The data has been converted to a corpus and a document term matrix, this should be all that's needed to analyse the data to find anything meaning full. 

There might have to be more data cleaning steps that have to be done after analysing the data if there are any anomilies