# Text manipulation

Hello everyone! For this section, we will be learning how to manipulate text data using `TextBlob` and `Scikit-learn`. In particular, we will be using these packages to clean, format, and transform our text data into simpler text and vector representations. 

In [65]:
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from textblob import TextBlob as tb
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
stopwords = stopwords.words('english')

In [66]:
# Read our tweets from the previously created CSV
tweets = pd.read_csv('out/tweets.csv', index_col=None, header=0)
tweets.head()

Unnamed: 0,id,handle,created_at,text
0,1140836172530274304,ConnDiandra,2019-06-18 04:18:33,RT @SaraCarterDC: #TheSaraCarterShow: What's N...
1,1140836171112747008,lillerik,2019-06-18 04:18:33,RT @viticci: Never seen this alert before – Ap...
2,1140836170051534848,bae_hon,2019-06-18 04:18:33,RT @Shazam: We love @OfficialMonstaX &amp; @Fr...
3,1140836169279631360,megandurazo,2019-06-18 04:18:33,RT @speriod: we need a new fiona apple album
4,1140836169174749184,Nahirk,2019-06-18 04:18:33,"RT @ij_baird: Help make FaceTime awesome, by a..."


### Text cleaning
When cleaning our data, we want to remove unnecessary characters such as punctuations and whitespace. This is so that we can focus solely on the terms found in the text

In [67]:
def clean_tweets(tweets):
    """
    Replaces empty tweets, replaces text with lower case characters,
    remove special characters and RTs, remove leading and trailing
    whitespaces, and remove stopwords.
    """
    tweets['cleaned_text'] = tweets['text'].fillna('')
    tweets['cleaned_text'] = tweets['cleaned_text'].str.lower()
    tweets['cleaned_text'] = tweets['cleaned_text'].str.replace(r'(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|rt|\d+', '')
    tweets['cleaned_text'] = tweets['cleaned_text'].str.replace(r'^\s+|\s+$', '') 
    tweets['cleaned_text'] = tweets['cleaned_text'].apply(lambda x: ' '.join([w for w in x.split() if w not in (stopwords)]))
    return tweets

In [68]:
# Clean tweets
cleaned_tweets = clean_tweets(tweets)
cleaned_tweets.head()

Unnamed: 0,id,handle,created_at,text,cleaned_text
0,1140836172530274304,ConnDiandra,2019-06-18 04:18:33,RT @SaraCarterDC: #TheSaraCarterShow: What's N...,thesaracaershow whats next russiainvestigation...
1,1140836171112747008,lillerik,2019-06-18 04:18:33,RT @viticci: Never seen this alert before – Ap...,never seen ale apple tells app youre deleting ...
2,1140836170051534848,bae_hon,2019-06-18 04:18:33,RT @Shazam: We love @OfficialMonstaX &amp; @Fr...,love amp whodoylove
3,1140836169279631360,megandurazo,2019-06-18 04:18:33,RT @speriod: we need a new fiona apple album,need new fiona apple album
4,1140836169174749184,Nahirk,2019-06-18 04:18:33,"RT @ij_baird: Help make FaceTime awesome, by a...",baird help make facetime awesome applying swee...


In [69]:
# Export the cleaned tweets into CSV
cleaned_tweets.to_csv('out/cleaned_tweets.csv', index=False)

### Text representation
We also want to be able to transform our data from terms into numerals where we can apply quantitative techniques.

1. **Document-term matrix**: occurence of words across documents
2. **N-gram matrix**: occurence of n-grams (phrases of n length) accross documents
3. **TFIDF matrix**: term frequency adjusted by the rarity of the in documents


In [70]:
def tweets_to_dtm(tweets):
    tweets = tweets['cleaned_text']
    vectorizer = CountVectorizer(max_features=2000)
    dtm = vectorizer.fit_transform(tweets)
    pickle.dump(vectorizer, open('out/dtm.pk', 'wb'))
    return dtm, vectorizer

def tweets_to_ngram(tweets, n=2):
    tweets = tweets['cleaned_text']
    vectorizer = CountVectorizer(
        ngram_range=(n, n),
        token_pattern=r'\b\w+\b',
        min_df=1,
        max_features=2000)
    dtm = vectorizer.fit_transform(tweets)
    pickle.dump(vectorizer, open('out/ngram.pk', 'wb'))
    return dtm, vectorizer

def tweets_to_tfidf(tweets):
    tweets = tweets['cleaned_text']
    vectorizer = TfidfVectorizer(max_features=2000)
    tfidf = vectorizer.fit_transform(tweets)
    pickle.dump(vectorizer, open('out/tfidf.pk', 'wb'))
    return tfidf, vectorizer

In [71]:
# Get document-term matrix
dtm, dtm_v = tweets_to_dtm(cleaned_tweets)
print('DTM shape:', dtm.toarray().shape)
list(dtm_v.vocabulary_.items())[0:5]

DTM shape: (1000, 2000)


[('thesaracaershow', 1692),
 ('whats', 1908),
 ('next', 968),
 ('russiainvestigation', 1353),
 ('st', 1551)]

In [72]:
# Get ngram matrix
ngram, ngram_v = tweets_to_ngram(cleaned_tweets, n=2)
print('Ngram matrix shape:', ngram.toarray().shape)
list(ngram_v.vocabulary_.items())[0:5]

Ngram matrix shape: (1000, 2000)


[('thesaracaershow whats', 1514),
 ('whats next', 1873),
 ('next russiainvestigation', 681),
 ('russiainvestigation st', 1218),
 ('st podcast', 1391)]

In [73]:
# Get TFIDF matrix
tfidf, tfidf_v = tweets_to_tfidf(cleaned_tweets)
print('TFIDF matrix shape:', tfidf.toarray().shape)
list(tfidf_v.vocabulary_.items())[0:5]

TFIDF matrix shape: (1000, 2000)


[('thesaracaershow', 1692),
 ('whats', 1908),
 ('next', 968),
 ('russiainvestigation', 1353),
 ('st', 1551)]

### Term frequencies
We can convert our text metrices back into a list terms and their accompanying frequency.  

In [74]:
def vector_to_frequency(vector, vectorizer):
    """
    Return a list of words and their corresponding occurence in the corpus
    """
    total = vector.sum(axis=0)
    frequency = [(w, total[0, i]) for w, i in vectorizer.vocabulary_.items()]
    frequency = pd.DataFrame(frequency, columns=['term', 'frequency'])
    frequency = frequency.sort_values(by='frequency', ascending=False).reset_index(drop=True)
    return frequency

In [75]:
freq_dtm = vector_to_frequency(dtm, dtm_v)
freq_dtm.to_csv('out/frequency_dtm.csv', index=False)
freq_dtm.head()

Unnamed: 0,term,frequency
0,apple,319
1,amp,209
2,love,175
3,whodoylove,163
4,podcast,93


In [76]:
freq_ngram = vector_to_frequency(ngram, bigram_v)
freq_ngram.to_csv('out/frequency_ngram.csv', index=False)
freq_ngram.head()

Unnamed: 0,term,frequency
0,kirk show,162
1,absolute madness,162
2,phones sucks,67
3,need food,67
4,taco bell,67


In [77]:
freq_tfidf = vector_to_frequency(tfidf, tfidf_v)
freq_tfidf.to_csv('out/frequency_tfidf.csv', index=False)
freq_tfidf.head()

Unnamed: 0,term,frequency
0,love,97.771467
1,whodoylove,97.085497
2,amp,95.66247
3,apple,50.661257
4,podcast,25.575891
