<img src="https://i.imgur.com/6U6q5jQ.png"/>

<a target="_blank" href="https://colab.research.google.com/github/SocialAnalytics-StrategicIntelligence/introTextData/blob/main/index.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Text as Data

Let me get use some old tweets from Donald Trump:

In [None]:
import pandas as pd
import os

trumpFile=os.path.join('textData','trumps.csv')
allTweets=pd.read_csv(trumpFile)
allTweets

Let me subset the dataframe, just to keep the non retweets:

In [None]:
DTtweets=allTweets[~allTweets.is_retweet]
DTtweets.reset_index(drop=True,inplace=True)

## Tokenization

A key step for text analytics is tokenization: where the text is broken into smaller pieces.

We can use:

- NLTK library:

In [None]:
import nltk
from nltk.tokenize import word_tokenize

DTtweets['text'].apply(nltk.word_tokenize)

* Pandas string functions:

In [None]:
DTtweets.text.str.split('\s')

The basic Pandas seems more convenient. Then, we simply create a series where each cell is a token (word):

In [None]:
import numpy as np

wordInSeries=pd.Series(np.concatenate(DTtweets.text.str.split('\s')))
wordInSeries

### Cleaning the tokens

In [None]:
wordInSeries=wordInSeries[~wordInSeries.str.startswith('http')].reset_index(drop=True)
wordInSeries

In [None]:
wordInSeries=wordInSeries.str.replace('[^\x01-\x7F]','')
wordInSeries=wordInSeries.str.replace('&amp;','and')
wordInSeries=wordInSeries.str.replace('&lt;|&gt;','')
wordInSeries

In [None]:
# punctuation
import string
PUNCs=string.punctuation # '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
wordInSeries=wordInSeries.str.replace('['+PUNCs+']', '',regex=True)

# all to lower case
wordInSeries=wordInSeries.str.lower()
wordInSeries

### Relevant tokens

It is difficult to know what tokens should not be analyzed. Let's count the current ones:

In [None]:
wordInSeries.value_counts()

We could agree that simple sintactic components like determinatives, conjunctions, or prepositions do carry much information. Most of these elements are known as **STOPWORDS**.  We use them to reduce our tokens:

In [None]:

from nltk.corpus import stopwords
STOPS = stopwords.words('english')


wordInSeries=wordInSeries[~wordInSeries.isin(STOPS)].reset_index(drop=True)
wordInSeries

## Word Frequency

We could keep prepare a frequency with the words remaining:

In [None]:
wordInSeries.value_counts()

Let's see the distribution of counts:

In [None]:
wordInSeries.value_counts().plot(logy=True, kind='hist')

In [None]:
FrequencyTrumpTokens=wordInSeries.value_counts()[wordInSeries.value_counts()>5]
FrequencyTrumpTokens

We have series, let me have a dict:

In [None]:
FrequencyTrumpTokens.to_dict()

### Plotting

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

import matplotlib.pyplot as plt
from wordcloud import WordCloud

wc1 = WordCloud(background_color='white')
wc1.generate_from_frequencies(frequencies=FrequencyTrumpTokens.to_dict())
plt.figure()
plt.imshow(wc1, interpolation="bilinear")
plt.axis("off")
plt.show()



In [None]:

wc2 = WordCloud(background_color='white',
                colormap="Reds")
wc2.generate_from_frequencies(frequencies=FrequencyTrumpTokens.to_dict())
plt.figure()
plt.imshow(wc2, interpolation="bilinear")
plt.axis("off")
plt.show()


## Bigrams

We can do the same with pairs of words (bigrams). Let me open a text file:

In [None]:
f = open("textData/sometext.txt", "r")

textFile=os.path.join('textData','sometext.txt')
allText=pd.read_table(textFile,header=None)

# see the text
allText

Let's normalize the text to lowercase:

In [None]:
allText[0]=allText[0].str.lower()
allText[0]=allText[0].str.replace('['+PUNCs+']', '',regex=True)

Let me create the bigrams:

In [None]:
from nltk import bigrams

theBigrams=[bigrams(eachTW.split()) for eachTW in allText[0]]


# list of all bigrams
from itertools import chain

pairWords = list(chain(*theBigrams))

pairWords

I will also use the **stopwords** here. I will get rid of any pair of words that include at least one of the **stopwords**:

In [None]:
pairWords_clean = [gram for gram in pairWords if not any(stop in gram for stop in STOPS)]
print(pairWords_clean)

At this stage, let me create a frequency table of the bigrams:

In [None]:
from collections import Counter

bigramsCount_dict = Counter(pairWords_clean) #generate counter

# Turn bigramsCount_dict  into dataframe, naming columns
bigramsCount = pd.DataFrame(bigramsCount_dict.most_common(),
                        columns=['theBigram', 'weight'])
bigramsCount

I need to create two columns from the tuples:

In [None]:
bigramsCount['word1'], bigramsCount['word2'] =zip(*bigramsCount['theBigram'])
bigramsCount

I will use those columns with networkx:

In [None]:
import networkx as nx

G_bigram=nx.from_pandas_edgelist(df=bigramsCount, source='word1',target= 'word2',edge_attr= ["weight"])

In [None]:

# plotting graph (default layout)
nx.draw_networkx(G_bigram)

I should subset:

In [None]:
#subsetting
bigramsCount_wgte_3=bigramsCount[bigramsCount['weight']>=3]

G_bigram_wgte_3=nx.from_pandas_edgelist(df=bigramsCount_wgte_3, source='word1',target= 'word2',edge_attr= ["weight"])

In [None]:

#plotting
fig, ax = plt.subplots(figsize=(10, 10))
pos = nx.spring_layout(G_bigram_wgte_3)

# Plot networks
nx.draw_networkx(G_bigram_wgte_3, pos,
                 edge_color='red',node_color='yellow',
                 node_size=100,with_labels = False,ax=ax)

# labels away from node
for word, freq in pos.items():
    x, y = freq[0]+.05, freq[1]+.03
    ax.text(x, y,s=word,horizontalalignment='center',
            fontsize=13,rotation=30)

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
pos = nx.spring_layout(G_bigram_wgte_3, weight='weight',k=0.6)
nx.draw_networkx(G_bigram_wgte_3, pos)

# labels away from node
for word, freq in pos.items():
    x, y = freq[0]+.05, freq[1]+.03
    ax.text(x, y,s=word,horizontalalignment='center',
            fontsize=13,rotation=30)

for edge in G_bigram_wgte_3.edges(data='weight'):
    nx.draw_networkx_edges(G_bigram_wgte_3, pos, edgelist=[edge], width=2*edge[2])




<div class="alert-success">

<strong>Exercise</strong>
    
1. Create a GitHub repo.
2. Create a notebook in python, and do a wordcloud with a text in English. Use a file in txt.
3. Create a notebook in python, and do a bigram the previous txt file.
4. Publish the result as a webpage using GitHub
    
</div>

<div class="alert alert-danger">
  <strong>CHALLENGE!</strong>
   <br> * Use the function [n-grams](https://tedboy.github.io/nlps/generated/generated/nltk.ngrams.html) from NLTK, for 3-grams and 4-grams. Use a text in Spanish.
</div>