In [40]:
import pandas as pd
import os
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gjber\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gjber\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\gjber\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [23]:
positive_path = 'aclImdb/train/pos/'
negative_path = 'aclImdb/train/neg/'

In [24]:
print("Number of texts in positive",len(os.listdir(positive_path)))
print("Number of texts in negative",len(os.listdir(negative_path)))

Number of texts in positive 12500
Number of texts in negative 12500


Now for the actual feature engineering. This can go many ways:

1. bag of words
2. word2vec
3. tri-gram
4. etc.
 
However, the overarching goal of this project is to classify the sentiment of text using linear classifiers.
As such, it appears that a bag of words approach will catch many important predictors (words like good, bad, love, hate etc.). However, this type of analysis will severely inflate the number of variables in the model. 

Possible work arounds could be stemming (or lemmatizing) words, removing stop words (these usually capture style and not sentiment), and only using words as predictors if they are in the top quartile of word frequencies or something of that nature.

Looking out for more succinct and creative ways to capture this sentiment is also perhaps an avenue worth pursuing.

## Read in data

In [36]:
#read text files from train folder
pos_train_txt = []
pos_train_label = []

for file_name in os.listdir(positive_path):
    data = open(positive_path + file_name, encoding='utf-8').read()
    pos_train_txt.append(data)
    pos_train_label.append('pos')
    
neg_train_txt = []
neg_train_label = []
for file_name in os.listdir(negative_path):
    data = open(negative_path + file_name, encoding='utf-8').read()
    neg_train_txt.append(data)
    neg_train_label.append('neg')

In [45]:
# Create a pandas dataframe from the text
train_pos = pd.DataFrame({'text':pos_train_txt,'label':pos_train_label})
train_neg = pd.DataFrame({'text':neg_train_txt,'label':neg_train_label})
train = train_pos.append(train_neg)
train.head()

Unnamed: 0,text,label
0,Bromwell High is a cartoon comedy. It ran at t...,pos
1,Homelessness (or Houselessness as George Carli...,pos
2,Brilliant over-acting by Lesley Ann Warren. Be...,pos
3,This is easily the most underrated film inn th...,pos
4,This is not the typical Mel Brooks film. It wa...,pos


## Let's begin by removing stop words and lemmatizing the rest.

- lemmatizing over stemming is chosen because it should produce a smaller subset of features

In [46]:
# Word tokenize first
train.text = train.text.apply(lambda x: nltk.word_tokenize(x))

In [47]:
# # Remove stop words
train.text = train.text.apply(lambda x: [word for word in x if word not in stopwords.words()])

KeyboardInterrupt: 

In [None]:
# Lemmatize the remaining words
lemmatizer = WordNetLemmatizer()
train.text = train.text.apply(lambda x: [lemmatizer.lemmatize(word) for word in x])