# Text manipulation

Hello everyone! For this section, we will be learning how to manipulate text data using `TextBlob` and `Scikit-learn`. In particular, we will be using these packages to clean, format, and transform our text data into simpler text and vector representations. 

In [1]:
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from textblob import TextBlob as tb
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
stopwords = stopwords.words('english')

In [2]:
# Read our tweets from the previously created CSV
tweets = pd.read_csv('out/tweets.csv', index_col=None, header=0)
tweets.head()

Unnamed: 0,id,handle,created_at,text
0,1141553866938261505,DoodleSpooks,2019-06-20 03:50:25,"RT @BirdKeeperToby: The Galar region, new Poke..."
1,1141552242912309248,EZDBud,2019-06-20 03:43:58,@JammerHighwind We needed more wooloo in our l...
2,1141551914066333696,Jay_Jitters,2019-06-20 03:42:39,RT @Xephia: A wooloo cloud ☁️🌱 https://t.co/WS...
3,1141551873297649669,toboldlylaura,2019-06-20 03:42:30,Friend: I can’t sleep :(\r\nMe: want to count ...
4,1141551868163829760,Jay_Jitters,2019-06-20 03:42:29,RT @Phoelion: Woohoo! It’s Wooloo!! 💖 https://...


### Text cleaning
When cleaning our data, we want to remove unnecessary characters such as punctuations and whitespace. This is so that we can focus solely on the terms found in the text

In [3]:
def clean_tweets(tweets):
    """
    Replaces empty tweets, replaces text with lower case characters,
    remove special characters and RTs, remove leading and trailing
    whitespaces, and remove stopwords.
    """
    tweets['cleaned_text'] = tweets['text'].fillna('')
    tweets['cleaned_text'] = tweets['cleaned_text'].str.lower()
    tweets['cleaned_text'] = tweets['cleaned_text'].str.replace(r'(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|rt|\d+', '')
    tweets['cleaned_text'] = tweets['cleaned_text'].str.replace(r'^\s+|\s+$', '') 
    tweets['cleaned_text'] = tweets['cleaned_text'].apply(lambda x: ' '.join([w for w in x.split() if w not in (stopwords)]))
    return tweets

In [4]:
# Clean tweets
cleaned_tweets = clean_tweets(tweets)
cleaned_tweets.head()

Unnamed: 0,id,handle,created_at,text,cleaned_text
0,1141553866938261505,DoodleSpooks,2019-06-20 03:50:25,"RT @BirdKeeperToby: The Galar region, new Poke...",galar region new pokemon dreadnaw wooloo grook...
1,1141552242912309248,EZDBud,2019-06-20 03:43:58,@JammerHighwind We needed more wooloo in our l...,needed wooloo lives
2,1141551914066333696,Jay_Jitters,2019-06-20 03:42:39,RT @Xephia: A wooloo cloud ☁️🌱 https://t.co/WS...,wooloo cloud
3,1141551873297649669,toboldlylaura,2019-06-20 03:42:30,Friend: I can’t sleep :(\r\nMe: want to count ...,friend cant sleep want count sheepme proceeds ...
4,1141551868163829760,Jay_Jitters,2019-06-20 03:42:29,RT @Phoelion: Woohoo! It’s Wooloo!! 💖 https://...,woohoo wooloo


In [5]:
# Export the cleaned tweets into CSV
cleaned_tweets.to_csv('out/cleaned_tweets.csv', index=False)

### Text representation
We also want to be able to transform our data from terms into numerals where we can apply quantitative techniques.

1. **Document-term matrix**: occurence of words across documents
2. **N-gram matrix**: occurence of n-grams (phrases of n length) accross documents
3. **TFIDF matrix**: term frequency adjusted by the rarity of the in documents


In [1]:
def tweets_to_dtm(tweets):
    tweets = tweets['cleaned_text']
    vectorizer = CountVectorizer(max_features=2000)
    dtm = vectorizer.fit_transform(tweets)
    pickle.dump(vectorizer, open('out/dtm.pk', 'wb'))
    return dtm, vectorizer

def tweets_to_ngram(tweets, n=2):
    tweets = tweets['cleaned_text']
    vectorizer = CountVectorizer(
        ngram_range=(n, n),
        token_pattern=r'\b\w+\b',
        min_df=1,
        max_features=2000)
    dtm = vectorizer.fit_transform(tweets)
    pickle.dump(vectorizer, open('out/ngram.pk', 'wb'))
    return dtm, vectorizer

def tweets_to_tfidf(tweets):
    tweets = tweets['cleaned_text']
    vectorizer = TfidfVectorizer(max_features=2000)
    tfidf = vectorizer.fit_transform(tweets)
    pickle.dump(vectorizer, open('out/tfidf.pk', 'wb'))
    return tfidf, vectorizer

In [7]:
# Get document-term matrix
dtm, dtm_v = tweets_to_dtm(cleaned_tweets)
print('DTM shape:', dtm.toarray().shape)
list(dtm_v.vocabulary_.items())[0:5]

DTM shape: (1000, 1334)


[('galar', 435),
 ('region', 906),
 ('new', 736),
 ('pokemon', 835),
 ('dreadnaw', 311)]

In [8]:
# Get ngram matrix
ngram, ngram_v = tweets_to_ngram(cleaned_tweets, n=2)
print('Ngram matrix shape:', ngram.toarray().shape)
list(ngram_v.vocabulary_.items())[0:5]

Ngram matrix shape: (1000, 2000)


[('galar region', 572),
 ('region new', 1221),
 ('new pokemon', 959),
 ('pokemon dreadnaw', 1095),
 ('dreadnaw wooloo', 413)]

In [9]:
# Get TFIDF matrix
tfidf, tfidf_v = tweets_to_tfidf(cleaned_tweets)
print('TFIDF matrix shape:', tfidf.toarray().shape)
list(tfidf_v.vocabulary_.items())[0:5]

TFIDF matrix shape: (1000, 1334)


[('galar', 435),
 ('region', 906),
 ('new', 736),
 ('pokemon', 835),
 ('dreadnaw', 311)]

### Term frequencies
We can convert our text metrices back into a list terms and their accompanying frequency.  

In [10]:
def vector_to_frequency(vector, vectorizer):
    """
    Return a list of words and their corresponding occurence in the corpus
    """
    total = vector.sum(axis=0)
    frequency = [(w, total[0, i]) for w, i in vectorizer.vocabulary_.items()]
    frequency = pd.DataFrame(frequency, columns=['term', 'frequency'])
    frequency = frequency.sort_values(by='frequency', ascending=False).reset_index(drop=True)
    return frequency

In [11]:
freq_dtm = vector_to_frequency(dtm, dtm_v)
freq_dtm.to_csv('out/frequency_dtm.csv', index=False)
freq_dtm.head()

Unnamed: 0,term,frequency
0,wooloo,1045
1,outfits,309
2,sheep,198
3,girl,190
4,want,170


In [13]:
freq_ngram = vector_to_frequency(ngram, ngram_v)
freq_ngram.to_csv('out/frequency_ngram.csv', index=False)
freq_ngram.head()

Unnamed: 0,term,frequency
0,outfits sheep,155
1,sheep girl,155
2,girl wooloo,155
3,wooloo gossifleur,155
4,gossifleur fun,155


In [14]:
freq_tfidf = vector_to_frequency(tfidf, tfidf_v)
freq_tfidf.to_csv('out/frequency_tfidf.csv', index=False)
freq_tfidf.head()

Unnamed: 0,term,frequency
0,wooloo,146.323204
1,outfits,94.386514
2,sheep,52.128199
3,girl,51.715751
4,want,48.590284
