# Tuning Count Vectorization - One Hot Encoding and other Features

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
plots_df = pd.read_csv("movie_plots.csv")

# filter only for American movies
plots_df = plots_df[plots_df["Origin/Ethnicity"] == "American"]

 # traditional CountVectorizer
vectorizer = CountVectorizer()

 # use English stopwords, and use one-hot encoding
vectorizer = CountVectorizer(stop_words="english", binary=True)

# use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
vectorizer = CountVectorizer(stop_words="english", binary=True, min_df=0.05) 

# use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
# and keep only the top 200
vectorizer = CountVectorizer(stop_words="english", binary=True, min_df=2, max_features=200) 

# use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
# and keep only the top 200
vectorizer = CountVectorizer(stop_words="english", binary=True, min_df=2, max_features=200) 

X = vectorizer.fit_transform(plots_df["Plot"])

vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
print(f"Shape of dataframe is {vectorized_df.shape}")
print(f"Total number of occurences: {vectorized_df.sum().sum()}")
#print(f"Word counts: {vectorized_df.sum()}")
vectorized_df.head()

Shape of dataframe is (655, 200)
Total number of occurences: 29652


Unnamed: 0,able,accidentally,agrees,appears,arrive,arrives,asks,attack,attacks,attempt,...,way,wife,woman,work,working,world,year,years,york,young
0,0,0,0,1,0,0,1,1,1,0,...,0,0,1,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,1,0,...,1,0,1,0,0,1,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,1,1,0,0
4,0,0,0,1,1,0,1,0,0,0,...,1,1,0,0,0,0,0,0,0,1


# Cosine Similarity Example

### Intro to Algorithmic Marketing:
![alt text](images/cos-sim-textbook1.png "Logo Title Text 1")


## Finding Magnitude of a Vector

In [22]:
import math
import numpy as np
def magnitude(x): 
    return math.sqrt(sum(i**2 for i in x))

vectorA = [0,3,1,2]

print(f"First approach: {magnitude(vectorA)}")
print(f"Second approach: {np.linalg.norm(vectorA)}")

First approach: 3.7416573867739413
Second approach: 3.7416573867739413


# Pointwise Mutual Information

It's important to identify a **context window** when analyzing co-occurence. In the image below, the context window size is 4 (2 tokens to either side of the target word):

![alt text](images/context_window.png "Logo Title Text 1")

For the purposes of the next section, we'll define the **entire document as the context window.**

Pointwise mutual information measures the ratio between the **joint probability of two events happening** with the probabilities of the two events happening, assuming they are independent. It can be defined with the following equation:

$$
PMI_{A,B} = log\frac{p(A,B)}{p(A)p(B)}
$$

Remember that when two events are independent, $P(i,j) = P(i)P(j)$. Using PMI to just a raw word count is often preferable because very common words have extreme skew ("the" and "of" will co-occur frequently in the same  )

```python
import math
def pmi(tokenA, tokenB, documents, word_counts):
    
    # word_counts[token_A] => number of times tokenA appears in the documents
    # float(len(documents)) => number of documents
    # bigram_freq => a dictionary of the number of times tokenA and tokenB are in the same document together
    
    prob_A = word_counts[tokenA] / float(len(documents))
    prob_B = word_counts[tokenB] / float(len(documents))
    prob_A_B = bigram_freq[" ".join([tokenA, tokenB])] / float(len(documents))
    return math.log(prob_A_B/float(prob_A*prob_B),2) 
```

# Collocation

Many times, in previous homeworks, we've had to manually try to find phrases that belong together. For example, `New York City`.

From [nltk.org](http://www.nltk.org/howto/collocations.html), **collocation** can be defined as

> expressions of multiple words which commonly co-occur together. 

In [3]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
lemmatizer = WordNetLemmatizer()
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english') + [".",'.', ",",":", "''", "'s", "'", "``", "(", ")", "-"])

In [4]:
documents = []
articles = [f"bbcsport/football/00{i}.txt" for i in range(1,10)]

for article in articles:
    article = open(article) # open each sports article
    for line in article.readlines():
        line = line.replace("\n", "") # replace the new line escape character
        if len(line) > 0: # if the line is not empty, process it
            line = [lemmatizer.lemmatize(token) for token in word_tokenize(line)] 
            documents.append(line)

In [5]:
new_documents = []
for doc in documents:
    new_document = []
    for word in doc:
        if word.strip().lower() not in stopwords:
            new_document.append(word)
    new_documents.append(new_document)

In [6]:
collocation_finder = BigramCollocationFinder.from_documents(new_documents)
measures = BigramAssocMeasures()

collocation_finder.nbest(measures.raw_freq, 15)

[('Champions', 'League'),
 ('Manchester', 'United'),
 ('Cristiano', 'Ronaldo'),
 ('Van', 'Nistelrooy'),
 ('Wayne', 'Rooney'),
 ('Alex', 'Ferguson'),
 ('FA', 'Cup'),
 ('Ferguson', 'wa'),
 ('Gary', 'Neville'),
 ('Man', 'Utd'),
 ('Manchester', 'City'),
 ('Sir', 'Alex'),
 ('national', 'team'),
 ('wa', "n't"),
 ('23', 'minute')]

# Term Frequency / Inverse Document Frequency


## Term Frequency
![alt text](images/tf-idf1.png "Term Frequency")

## Inverse Document Frequency
![alt text](images/tf-idf2.png "Inverse Document Frequency")

### Example Calculation

![alt text](images/tf-idf4.png "Example")

## Using Scikit-Learn to Generate TF-IDF

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

vectorizer = TfidfVectorizer(ngram_range=(3,4),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             max_df=0.4, stop_words=stopwords.words())

In [2]:
df = pd.read_csv("mcdonalds-yelp-negative-reviews.csv", encoding="latin1")
corpus = list(df["review"].values)

X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

  'stop_words.' % sorted(inconsistent))


In [3]:
tf_idf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1515,1516,1517,1518,1519,1520,1521,1522,1523,1524
aaaaaaaahhhhhhhhhhh still feel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aaaaaaaahhhhhhhhhhh still feel situation,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abbreviated menu worthy,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abbreviated menu worthy mcdonald,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abc kitchen numerous,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zombies bikes stopped stare,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zombies little less,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zombies little less predictable,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zoom line person,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)

In [5]:
score

Unnamed: 0,score
worst mcdonald ever,2.602793
get order right,2.390093
worst mcdonalds ever,2.342385
went drive thru,1.754710
drive thru window,1.624240
...,...
need let peeps yelper,0.033990
part mcwrap could,0.033990
much stinky yes stinky,0.033990
much stinky yes,0.033990


In [6]:
score.to_csv("scores.csv")

## Exercises

For the following exercises, use the definitions below:

**Term frequency**:
$$
tf = n(t,d)
$$
**Inverse document frequency**:
$$
idf = 1 + \frac{N}{df(t) + 1}
$$

For the following exercise, you can perform `lemmatization` and remove `to`, `and`, and `the` as stopwords.

In [1]:
documents = [
    "He ate the food",
    "He liked the meal",
    "She likes the food",
    "They like to eat and eat"
]

### Calculate the TF-IDF score for `eat` in each of the documents

### In the corpus below, are `fruit` and `love` independent in terms of their appearance in documents? Prove why or why not.

In [2]:
corpus = [
    "I love fruit",
    "She eats fruit",
    "They love fruit",
    "I love the toy",
    "She ate dinner"
]

### True/False (Explain Why If False)

#### Two words `A` and `B` must be independent if they never appear in the same document together.

#### The probability of seeing the word `car` in a document is always equal to or larger than the probability of seing the word `car` given any other word.