## Text Prepoccessing and Classification 

In [58]:
from os.path import expanduser
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from collections import Counter

The objective of this notebook is to understand the process of preparing text data for a machine learning classification problem. By the end of this notebook, we will have trained a logistic regression model

### Subjectivity dataset
The subjectivity dataset has 5000 subjective and 5000 objective processed sentences. To get the data:

In [1]:
def unpack_dataset():
    ! wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz
    ! tar -xvf rotten_imdb.tar.gz -C ~/data

In [2]:
unpack_dataset()

--2023-09-12 10:32:07--  http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 519599 (507K) [application/x-gzip]
Saving to: ‘rotten_imdb.tar.gz’


2023-09-12 10:32:10 (1.10 MB/s) - ‘rotten_imdb.tar.gz’ saved [519599/519599]

x quote.tok.gt9.5000
x plot.tok.gt9.5000
x subjdata.README.1.0


In [5]:
# these are subjective sentences
! head -3 ~/data/quote.tok.gt9.5000

smart and alert , thirteen conversations about one thing is a small gem . 
color , musical bounce and warm seas lapping on island shores . and just enough science to send you home thinking . 
it is not a mass-market entertainment but an uncompromising attempt by one artist to think about another . 


In [6]:
# these are objective sentences
! head -3 ~/data/plot.tok.gt9.5000

the movie begins in the past where a young boy named sam attempts to save celebi from a hunter . 
emerging from the human psyche and showing characteristics of abstract expressionism , minimalism and russian constructivism , graffiti removal has secured its place in the history of modern art while being created by artists who are unconscious of their artistic achievements . 
spurning her mother's insistence that she get on with her life , mary is thrown out of the house , rejected by joe , and expelled from school as she grows larger with child . 


### Create a pandas dataframe
Import data in a pandas dataframe, create labels. Each sentence is an observation that can be objective (label 1) or subjective (label 0).

In [7]:
def read_file(path):
    """ Read file returns a list of lines.
    """
    with open(path, encoding = "ISO-8859-1") as f:
        content = f.readlines()
    return content

In [11]:
path = expanduser("~/data/quote.tok.gt9.5000")
sub_lines = read_file(path)

In [12]:
path = expanduser("~/data/plot.tok.gt9.5000")
obj_lines = read_file(path)

In [16]:
df0 = pd.DataFrame({"text": sub_lines})
df0["label"] = 0
df0.head()

Unnamed: 0,text,label
0,"smart and alert , thirteen conversations about...",0
1,"color , musical bounce and warm seas lapping o...",0
2,it is not a mass-market entertainment but an u...,0
3,a light-hearted french film about the spiritua...,0
4,my wife is an actress has its moments in looki...,0


In [17]:
df1 = pd.DataFrame({"text": obj_lines})
df1["label"] = 1
df1.head()

Unnamed: 0,text,label
0,the movie begins in the past where a young boy...,1
1,emerging from the human psyche and showing cha...,1
2,spurning her mother's insistence that she get ...,1
3,amitabh can't believe the board of directors a...,1
4,"she , among others excentricities , talks to a...",1


In [19]:
## concatenate the two files 
# Concatenate them vertically (along rows)
df = pd.concat([df0, df1])

# Reset the index, especially if you want a continuous index
df = df.reset_index(drop=True)

In [20]:
df.shape

(10000, 2)

### Tokenizing with Spacy

In [193]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [194]:
def tokenizing(text):
    """Given a string of text return a list of tokens.
    
    Use Spacy do do the tokenization. Exclude punctuations, stopwords and '\n'.
    """
    # Write code here
    return 

In [92]:
tokenizing(test_string)

['joe', 'expelled', 'school', 'grows', 'larger', 'child']

In [93]:
test_string = "joe , and expelled from school as she grows larger with child \n."
test_list = ['joe', 'expelled', 'school', 'grows', 'larger', 'child']

In [94]:
test = tokenizing(test_string)
for i in range(len(test)):
    assert(test[i] == test_list[i])

In [95]:
# create a new column in df with the tokenize list. Hint: use df.apply()
# Write code here
df["tokens"] = 

In [96]:
df.head()

Unnamed: 0,text,label,tokens
0,"smart and alert , thirteen conversations about...",0,"[smart, alert, thirteen, conversations, thing,..."
1,"color , musical bounce and warm seas lapping o...",0,"[color, musical, bounce, warm, seas, lapping, ..."
2,it is not a mass-market entertainment but an u...,0,"[mass, market, entertainment, uncompromising, ..."
3,a light-hearted french film about the spiritua...,0,"[light, hearted, french, film, spiritual, ques..."
4,my wife is an actress has its moments in looki...,0,"[wife, actress, moments, looking, comic, effec..."


In [97]:
# safe df in as a pickle file with the name subjectivity.pickle in ~/data
# TODO: delete
# Write code here

In [98]:
df = pd.read_pickle("~/data/subjectivity.pickle")

### Split data in train and test

In [99]:
shuffled_df = df.sample(frac=1.0, random_state=42)

# Split the shuffled DataFrame into training and testing sets (e.g., 80% train, 20% test)
train_df, test_df = train_test_split(shuffled_df, test_size=0.2, random_state=42)

In [100]:
len(train_df), len(test_df)

(8000, 2000)

### Feature selection 
Compute term frequency for all documents in the train data.

In [101]:
def concatenate_lists(list_of_list):
    """ Given a list of lists create a single list
    """
    # Write code here
    return all_tokens

In [102]:
test_list_of_list = [[1, 2, 3], [4, 5], [6, 7, 8]]
test_actual = concatenate_lists(test_list_of_list)
test_expected = [1, 2, 3, 4, 5, 6, 7, 8]
for i in range(len(test_expected)):
    assert(test_expected[i] == test_actual[i])

In [103]:
def compute_freq(df):
    """ Given a dataframe returns a dictionary of tokens to frequency
    """
    # Write code here
    return counts

In [108]:
token_freq = compute_freq(train_df)
assert len(token_freq) == 18705
assert token_freq['film'] == 801

In [109]:
def top_K_most_common_tokens(token_freq, K=100):
    """ Returns a list of the K most common tokens in the corpus."""
    # Write code here
    return 

In [113]:
expected = ['film', 'movie', 'story', 'life', 'love', 'like', 'new', 'time', 'world', 'man']
assert top_K_most_common_tokens(token_freq, K=10) == expected

In [114]:
top_100 = top_K_most_common_tokens(token_freq)

### Creating bag of word features

In [189]:
def bag_of_word_encoding(tokens, keywords=top_100):
    """Creates a bag of word econding for an observation
    Given a list of tokens and a list of keywords returns
    a list of 0 and 1s
    """
    Write code here
    return 

In [190]:
test_keywords = ['film', 'movie', 'story']
assert bag_of_word_encoding(['film'], test_keywords) == [1, 0, 0]
assert bag_of_word_encoding(['movie', 'fun'], test_keywords) == [0, 1, 0]
assert bag_of_word_encoding(['happy', 'fun'], test_keywords) == [0, 0, 0]

In [177]:
# Add a encoding with the top 100 tokens to the train and test dataframes
# use the apply function
# Write code here
train_df["encoding"] = 

In [178]:
# Write code here
test_df["encoding"] = 

### Creating a baseline model

In [182]:
X_train = train_df["encoding"].values
X_test = test_df["encoding"].values
y_train = train_df["label"].values
y_test = test_df["label"].values

In [186]:
# making a list of list a 2D np.array
X_train = np.array([np.array(xi) for xi in X_train])
X_train.shape

(8000, 100)

In [188]:
X_test = np.array([np.array(xi) for xi in X_test])
X_test.shape

(2000, 100)

In [163]:
from sklearn.linear_model import LogisticRegression

In [187]:
clf = LogisticRegression(random_state=0).fit(X_train, y_train)

In [192]:
# this is the accuracy
clf.score(X_test, y_test)

0.7485

Can we improve accuracy by incresing the number of keywords?