# Social Media Mining: Text and Numbers
### Vincent Malic - Spring 2018

## Part I. Representing Text as Numbers
* Dataset of entirely categorical features that we will use.
* Four instances, 3 features, categorical text responses
* Represent it numerically using method of one-hot encoding.

In [1]:
import pandas as pd

df = pd.DataFrame([
        ["male", "Europe", "Internet Explorer"],
        ["female", "Europe", "Firefox"],
        ["male", "Asia", "Internet Explorer"],
        ["male", "Asia", "Internet Explorer"],
        ["female", "North America", "Chrome"],
        ["female", "North America", "Firefox"]
    ], columns=["gender", "continent", "browser"])

df

Unnamed: 0,gender,continent,browser
0,male,Europe,Internet Explorer
1,female,Europe,Firefox
2,male,Asia,Internet Explorer
3,male,Asia,Internet Explorer
4,female,North America,Chrome
5,female,North America,Firefox


## One-hot encoding: Label binarization (dummy coding)
For a given categorical feature (e.g., "Continent") with $n$ possible labels, 
* For each label, make $n$ new features, one for each label. 
* Feature takes a value of $1$ that corresponds to the label of datapoint, and $0$ Otherwise
* Use sklearn tool called ``LabelBinarizer``.

In [2]:
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()

Use `Label Binarizer instance` created, ``lb``, on ``continent`` feature to demonstrate one-hot encoding. 

In [3]:
new_cont = lb.fit_transform(df['continent'])

In [4]:
new_cont

array([[0, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 0, 1]])

### Label binarizer created three new features (columns). 
* Each column corresponds to one of the three possible labels: Europe, Asia, North America. 
* In ``new_cont``, second column Europe has a value of 1, while other two have values of 0. 
* In the 3rd row, the first column (Asia) has a value of 1 and the other two have values of 0. 

Whatever your label was in original data, you get a value of 1 in the column corresponding to that label in the one-hot encoding, and a 0 in the other columns. 

### Can use one-hot encoding on a feature with many labels. 
* With new representation, may be hard to figure out which column corresponds to which label. 
* Fortunately, the Label Binarizer object - in this case ``lb`` - stores that information. 

In [5]:
lb.classes_

array(['Asia', 'Europe', 'North America'], 
      dtype='<U13')

## Dummy Variable encoding
* First column represents Asia
* Second column represents Europe
* Third represents North America 

### Convert the one-hot encoding variables into a data frame.

In [6]:
new_cont = pd.DataFrame(new_cont, columns=lb.classes_, index=df.index)
new_cont

Unnamed: 0,Asia,Europe,North America
0,0,1,0
1,0,1,0
2,1,0,0
3,1,0,0
4,0,0,1
5,0,0,1


## Join dummy variables to original dataFrame

In [7]:
df = df.join(new_cont)

In [8]:
df

Unnamed: 0,gender,continent,browser,Asia,Europe,North America
0,male,Europe,Internet Explorer,0,1,0
1,female,Europe,Firefox,0,1,0
2,male,Asia,Internet Explorer,1,0,0
3,male,Asia,Internet Explorer,1,0,0
4,female,North America,Chrome,0,0,1
5,female,North America,Firefox,0,0,1


# Convert Browser feature into one-hot encoding:
* Initialize Label Binarizer as lb
* Create new browser object, convert to dataframe, with column heading from lb
* Join dummy variable lables and new data into DataFrame object

In [9]:
lb = LabelBinarizer()
new_browser = lb.fit_transform(df['browser'])
new_browser = pd.DataFrame(new_browser, columns=lb.classes_, index=df.index)
df = df.join(new_browser)
df

Unnamed: 0,gender,continent,browser,Asia,Europe,North America,Chrome,Firefox,Internet Explorer
0,male,Europe,Internet Explorer,0,1,0,0,0,1
1,female,Europe,Firefox,0,1,0,0,1,0
2,male,Asia,Internet Explorer,1,0,0,0,0,1
3,male,Asia,Internet Explorer,1,0,0,0,0,1
4,female,North America,Chrome,0,0,1,1,0,0
5,female,North America,Firefox,0,0,1,0,1,0


## Create binary labels for Gender:
* Single column column with value of `1` representing `male`, `0` for `female`. 
* With browser and continent, we had three labels, converted into three new features.

In [10]:
lb = LabelBinarizer()
new_gender = lb.fit_transform(df['gender'])
new_gender

array([[1],
       [0],
       [1],
       [1],
       [0],
       [0]])

## With Two mutually-exclusive labels 
* Only need single column to store two-labels: **male** == **not female**. 
* Designation **has feature A** logical equivalent of **does not have feature B**. 
* Indicates whether a data point *has* or *does not have* one of the labels. 

In [11]:
lb.classes_

array(['female', 'male'], 
      dtype='<U6')

### Sklearn arbitrarily represents female by 0 and male by 1
* Add this to original dataFrame. 

In [12]:
new_gender = pd.DataFrame(new_gender, columns=['bgender'], index=df.index)
new_gender

Unnamed: 0,bgender
0,1
1,0
2,1
3,1
4,0
5,0


In [13]:
df = df.join(new_gender)

In [14]:
df

Unnamed: 0,gender,continent,browser,Asia,Europe,North America,Chrome,Firefox,Internet Explorer,bgender
0,male,Europe,Internet Explorer,0,1,0,0,0,1,1
1,female,Europe,Firefox,0,1,0,0,1,0,0
2,male,Asia,Internet Explorer,1,0,0,0,0,1,1
3,male,Asia,Internet Explorer,1,0,0,0,0,1,1
4,female,North America,Chrome,0,0,1,1,0,0,0
5,female,North America,Firefox,0,0,1,0,1,0,0


## Extract only numerical features to DF
* Categorical features have been converted into numerical features. 
* Using pandas subsetting we can extract only numerical features 
* Creates valid dataset to use with a machine learning algorithm (linear regression).

In [15]:
to_algo = df[["Asia", "Europe", "North America", "Chrome", "Firefox", "Internet Explorer", "bgender"]]
to_algo

Unnamed: 0,Asia,Europe,North America,Chrome,Firefox,Internet Explorer,bgender
0,0,1,0,0,0,1,1
1,0,1,0,0,1,0,0
2,1,0,0,0,0,1,1
3,1,0,0,0,0,1,1
4,0,0,1,1,0,0,0
5,0,0,1,0,1,0,0


# Representing Raw Text Numerically
* Use **vector space model** to represent raw text numerically. 
* use ``CountVectorizer`` to make a vector space model representation of a set of texts.

In [16]:
texts = [
"We went to the bank to get some money. Usually, there is a lot of money there. But today, the bank had no money.",
"At the bank, there are a lot of people who work with money. The store money, count money, and try to make more money from their money.",
"We like to take walks along the bank, because the view of the water is beautiful. But today, unfortunately, the water had overran the bank and so we couldn't walk.",
"We drove our boat through the water next to the left bank of the river. There were people standing on the bank waving at us, and some were out in the water swimming."
]

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

In [18]:
cv = CountVectorizer()

In [19]:
vectors = cv.fit_transform(texts)
vectors

<4x59 sparse matrix of type '<class 'numpy.int64'>'
	with 87 stored elements in Compressed Sparse Row format>

### The output is a sparse matrix. 
* This is a valid input to all sklearn methods, including ``.fit``, ``cross_val_scores``, and ``.train_test_split``. 
* For purposes of instruction only, we convert it into an np.array using ``toarray``.

In [20]:
vectors = vectors.toarray()

In [21]:
vectors[0, :]

array([0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 3, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 2, 0, 2, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0], dtype=int64)

Each column of ``vectors`` corresponds to a word in the text corpus. Each row corresponds to a text. This is the row corresponding to the first text. It shows that it had zero instances of word 0. It had zero instances of word 1. It had 2 instances of word 4.

To know which column corresponds to which word, the Count Vectorizer object has an attribute called ``.vocabulary_``. 

In [22]:
cv.vocabulary_

{'along': 0,
 'and': 1,
 'are': 2,
 'at': 3,
 'bank': 4,
 'beautiful': 5,
 'because': 6,
 'boat': 7,
 'but': 8,
 'couldn': 9,
 'count': 10,
 'drove': 11,
 'from': 12,
 'get': 13,
 'had': 14,
 'in': 15,
 'is': 16,
 'left': 17,
 'like': 18,
 'lot': 19,
 'make': 20,
 'money': 21,
 'more': 22,
 'next': 23,
 'no': 24,
 'of': 25,
 'on': 26,
 'our': 27,
 'out': 28,
 'overran': 29,
 'people': 30,
 'river': 31,
 'so': 32,
 'some': 33,
 'standing': 34,
 'store': 35,
 'swimming': 36,
 'take': 37,
 'the': 38,
 'their': 39,
 'there': 40,
 'through': 41,
 'to': 42,
 'today': 43,
 'try': 44,
 'unfortunately': 45,
 'us': 46,
 'usually': 47,
 'view': 48,
 'walk': 49,
 'walks': 50,
 'water': 51,
 'waving': 52,
 'we': 53,
 'went': 54,
 'were': 55,
 'who': 56,
 'with': 57,
 'work': 58}

# Use Count Vectorizer with Twitter data
* Use Tweepy to get latest 500 tweets from IUB, assigned to tweet_texts
* Create a list of strings representing the text from those 500 tweets
* Use CountVectorizer function, fit_transform to represent text from tweets

In [23]:
API_KEY = ""
API_SECRET = ""
import tweepy
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

c = tweepy.Cursor(api.user_timeline, id="IUBloomington")

tweet_texts = []

for tweet in c.items(500):
    tweet_texts.append(tweet.text)

In [24]:
tweet_texts[:5]

["While other schools offer study abroad, IU is among the nation's top universities for both number of Fulbright Stud… https://t.co/e7IAcOoy54",
 "RT @IndianaUniv: President's Update: New developments in Indiana University's ongoing commitment to improving Hoosier health and well-being…",
 'RT @BTNLiveBIG: Researchers @IUBloomington are saving lives by teaching kids in #Vietnam how to #swim safely. \n\nhttps://t.co/IxNYd7I6WZ htt…',
 'Tonight, be part of an important discussion about the experiences that people of color have with law enforcement --… https://t.co/CkpQvfyJQo',
 "RT @IUNewsroom: IU professor Jeffrey White will discuss his personal experiences of Arctic climate warming for this year's @IUBloomington D…"]

### Resulting matrix has 500 rows with 2700 features
* Each element is the count of a particular word
* Vocabulary method returns library with word: count pairs
* Convert tweet to vectors and look at word frequency (sparse matrix)

In [25]:
cv = CountVectorizer()
vectors = cv.fit_transform(tweet_texts)

vectors.shape

(500, 2738)

In [26]:
#cv.vocabulary_

In [27]:
tweet0 = vectors[0, :]
print(list(tweet0))

[<1x2738 sparse matrix of type '<class 'numpy.int64'>'
	with 22 stored elements in Compressed Sparse Row format>]


## Problems with Count Vectors
* Every word is equal, has its own element, and frequency count
* Important words ("basketball") given equal importance as unimportant articles ("the")
* Don't account for the length of the text (proportion or occurence within tweet text)

# TF-IDF: Term Frequency-Inverse Document Freqency
* TF-IDF weight puts more emphasis on words that are characteristic to a document.
* `tfidf(w, d, D) = tf(w, d)*idf(w, D)`

### EXAMPLE:  1000 Documents
* Sample document with 150 instances of *the*, 20 instances of *astrophysics*. 

## Term Frequency (TF)
* `tf(word,document)= number of times w appears in d / totoal number of words in d`
* Word `the` occurs many times in document (high TF)
* By contrast, the *astrophysics* has low TF, used only rarely

##  Inverse Document Freqency: Proportion
* `idf(w,D) = log (number of documents in D / number of documents in D that have word w)`
* Word `the` also appears in *many other documents* (low IDF), 
* `astrophysics` does not appear in that many documents (high IDF).  

### Intuition of TF-IDF: 
* Low IDF for `the` dilutes high TF, resulting in overall low TFIDF score 
* High IDF for `astrophysics` overcomes low TF, resulting in higher TFIDF score


## Scikit-Learn module `TF-IDF` 
* Once you understand the difference between count vectorizing and TF-IDF vectorizing, 
* Use sklearn is essentially the same. 

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer()
vectors = tv.fit_transform(texts)
vectors = vectors.toarray()
vectors[0,:]

array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.24006301,
        0.        ,  0.        ,  0.        ,  0.18134668,  0.        ,
        0.        ,  0.        ,  0.        ,  0.23001526,  0.18134668,
        0.        ,  0.18134668,  0.        ,  0.        ,  0.18134668,
        0.        ,  0.54404003,  0.        ,  0.        ,  0.23001526,
        0.12003151,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.18134668,  0.        ,
        0.        ,  0.        ,  0.        ,  0.24006301,  0.        ,
        0.29363153,  0.        ,  0.24006301,  0.18134668,  0.        ,
        0.        ,  0.        ,  0.23001526,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.14681576,  0.23001526,
        0.        ,  0.        ,  0.        ,  0.        ])

## Each element in vector corresponds to a word in corpus
* Vector element represents the *TF-IDF weightinf of that word* rather than frequency in word text 