# Text Mining for Economics and Finance HW3 - Group 9

## Andres Brito Barreiro, Hans-Christian Aarnio, Hao Yao, Steven Kingaby

## Question 1

The corpus includes 64,706 reviews of Amazon Digital Music products, from 1997 to 2014. The code below uses the LASSO, Naive Bayesian Classifier, and Multinomial Inverse Regressio models.

In [6]:
# Import the required packages
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk import word_tokenize
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.model_selection import cross_val_score

In [2]:
# Load the dataset
ratings_data = pd.read_csv('Digital_Music_new.csv')

# Remove unnamed column as it's identical to the
# dataframe's default index. Furthermore reformat `Time`
# column values into datetime objects.
ratings_data = ratings_data.drop('Unnamed: 0', axis = 1)
ratings_data.Time = pd.to_datetime(ratings_data.Time, format = '%m %d, %Y')

ratings_data

Unnamed: 0,Rating,Review,Summary,Time
0,5.0,"It's hard to believe ""Memory of Trees"" came ou...",Enya's last great album,2006-09-12
1,5.0,"A clasically-styled and introverted album, Mem...",Enya at her most elegant,2001-06-03
2,5.0,I never thought Enya would reach the sublime h...,The best so far,2003-07-14
3,5.0,This is the third review of an irish album I w...,Ireland produces good music.,2000-05-03
4,4.0,"Enya, despite being a successful recording art...",4.5; music to dream to,2008-01-17
...,...,...,...,...
64701,4.0,I like the reggae sound a lot in this song. I ...,Cool song,2014-06-24
64702,5.0,I first heard this on Sirius and had to have i...,Great Song,2014-07-09
64703,5.0,"I absolutely love this song, it downloaded fin...",Five Stars,2014-07-13
64704,3.0,"Reggae, island beats aren't really my cup of t...",Well-crafted song,2014-07-09


According to the ratings of these products, we could classify our sample into two sets, highly rated and poorly rated. To do this, we use np.where with the condition that the Rating is greater than or equal to 4 out of 5 to split the observation into two groups, highly rated sample and poorly rated sample. If the product is highly rated, the indicator is 1; otherwise, it's 0.

In [3]:
ratings_data['Highly Rated'] = np.where(ratings_data['Rating'] >= 4.0, 1, 0)
ratings_data

Unnamed: 0,Rating,Review,Summary,Time,Highly Rated
0,5.0,"It's hard to believe ""Memory of Trees"" came ou...",Enya's last great album,2006-09-12,1
1,5.0,"A clasically-styled and introverted album, Mem...",Enya at her most elegant,2001-06-03,1
2,5.0,I never thought Enya would reach the sublime h...,The best so far,2003-07-14,1
3,5.0,This is the third review of an irish album I w...,Ireland produces good music.,2000-05-03,1
4,4.0,"Enya, despite being a successful recording art...",4.5; music to dream to,2008-01-17,1
...,...,...,...,...,...
64701,4.0,I like the reggae sound a lot in this song. I ...,Cool song,2014-06-24,1
64702,5.0,I first heard this on Sirius and had to have i...,Great Song,2014-07-09,1
64703,5.0,"I absolutely love this song, it downloaded fin...",Five Stars,2014-07-13,1
64704,3.0,"Reggae, island beats aren't really my cup of t...",Well-crafted song,2014-07-09,0


## Question 2

Now we continue with the usual steps of preprocessing the reviews and summary columns. First, we write a self-defined function on text preprocessing. Then, we need to collect the documents as corpus.

In [4]:
stopwords = set(stopwords.words("english"))
stemming = PorterStemmer()

# Before tokenization it is useful to remove all words of length 1 and to make everything lowercase 
# and to remove words of length one.
def remove_one_letter_words(document):
    document.values[0] = re.sub(r'\b[a-zA-Z]\b', '', document.values[0]) 

    return document

def clean_tokens(row, column_name):
    documents = row[column_name] # converts the documents (i.e. report) column of BoE_data to row
    tokens = word_tokenize(documents) # uses the word_tokenize function to tokenize each 'row', i.e. report
    alpha_tokens = [w for w in tokens if w.isalpha()] # removes all non-alphabetic tokens
    return alpha_tokens

def remove_stopwords(row, column_name):
    tokens = row[column_name] # converts the tokens column of new BoE_data to row
    useful_words = [w for w in tokens if not w in stopwords] # takes only the words that are not in the stopwords set
    return useful_words

def create_stems(row, column_name):
    stopwords_list = row[column_name] # converts the stopwords column of new BoE_data to row
    stemmed_list = [stemming.stem(word) for word in stopwords_list] # stems each word in each BoE_data row
    return stemmed_list

# Creating clean data and turning our pre processed data into single strings for each report.
def rejoin_words(row):
    my_list = row['stems']
    joined_words = ( " ".join(my_list))
    return joined_words

def preprocess(raw_data):
    raw_data = raw_data.apply(lambda document: document.astype(str).str.lower(), axis=1) # making everything lowercase
    raw_data = raw_data.apply(lambda document: remove_one_letter_words(document), axis=1) # remove one letter words
    raw_data['tokens'] = raw_data.apply(lambda x : clean_tokens(x, 'Review'), axis=1)
    raw_data['stopwords'] = raw_data.apply(lambda x : remove_stopwords(x, 'tokens'), axis=1)
    raw_data['stems'] = raw_data.apply(lambda x : create_stems(x, 'stopwords'), axis=1)
    raw_data['Preprocessed Review'] = raw_data.apply(rejoin_words, axis=1)

    # Clean up and remove columns that aren't needed in the long run.
    raw_data = raw_data.drop(['tokens', 'stopwords', 'stems'], axis = 1)

    return raw_data

ratings_data = preprocess(ratings_data)
ratings_data

Unnamed: 0,Rating,Review,Summary,Time,Highly Rated,Preprocessed Review
0,5.0,"it's hard to believe ""memory of trees"" came ou...",enya's last great album,2006-09-12 00:00:00,1,hard believ memori tree came year ago held wel...
1,5.0,"a clasically-styled and introverted album, mem...",enya at her most elegant,2001-06-03 00:00:00,1,introvert album memori tree masterpiec subtlet...
2,5.0,i never thought enya would reach the sublime h...,the best so far,2003-07-14 00:00:00,1,never thought enya would reach sublim height e...
3,5.0,this is the third review of an irish album i w...,ireland produces good music.,2000-05-03 00:00:00,1,third review irish album write today other cra...
4,4.0,"enya, despite being a successful recording art...",4.5; music to dream to,2008-01-17 00:00:00,1,enya despit success record artist broad appeal...
...,...,...,...,...,...,...
64701,4.0,i like the reggae sound a lot in this song. i ...,cool song,2014-06-24 00:00:00,1,like regga sound lot song heard radio realli l...
64702,5.0,i first heard this on sirius and had to have i...,great song,2014-07-09 00:00:00,1,first heard siriu fun song ca help guy ask dad...
64703,5.0,"i absolutely love this song, it downloaded fin...",five stars,2014-07-13 00:00:00,1,absolut love song download fine would recommen...
64704,3.0,"reggae, island beats aren't really my cup of t...",well-crafted song,2014-07-09 00:00:00,0,regga island beat realli cup team particularli...


In [7]:
from nltk.corpus import stopwords

# Initialize count vectoriser and repurpose it for use in it for the 
count_vectorizer = CountVectorizer(min_df = 0.0001, stop_words=stopwords.words('english'))
document_term_matrix = count_vectorizer.fit_transform(ratings_data['Preprocessed Review'])
document_term_matrix = document_term_matrix.toarray()
is_highly_rated = ratings_data['Highly Rated'].values

print(document_term_matrix.shape)
print(is_highly_rated)

(64706, 17890)
['1' '1' '1' ... '1' '0' '0']


In [8]:
# Rename variables to something more meaningful to
# the context of model fitting, predicting, etc.
X_set = document_term_matrix[10:]
Y_set = is_highly_rated[10:]

print(X_set.shape)
print(Y_set.shape)

# Set up training and test sets.
# X_train, X_test, Y_train, Y_test = train_test_split(X_set, Y_set, test_size = 0.1, random_state = 1)

(64696, 17890)
(64696,)


In [10]:
print(np.arange(0.001,2,0.01))


[1.000e-03 1.100e-02 2.100e-02 3.100e-02 4.100e-02 5.100e-02 6.100e-02
 7.100e-02 8.100e-02 9.100e-02 1.010e-01 1.110e-01 1.210e-01 1.310e-01
 1.410e-01 1.510e-01 1.610e-01 1.710e-01 1.810e-01 1.910e-01 2.010e-01
 2.110e-01 2.210e-01 2.310e-01 2.410e-01 2.510e-01 2.610e-01 2.710e-01
 2.810e-01 2.910e-01 3.010e-01 3.110e-01 3.210e-01 3.310e-01 3.410e-01
 3.510e-01 3.610e-01 3.710e-01 3.810e-01 3.910e-01 4.010e-01 4.110e-01
 4.210e-01 4.310e-01 4.410e-01 4.510e-01 4.610e-01 4.710e-01 4.810e-01
 4.910e-01 5.010e-01 5.110e-01 5.210e-01 5.310e-01 5.410e-01 5.510e-01
 5.610e-01 5.710e-01 5.810e-01 5.910e-01 6.010e-01 6.110e-01 6.210e-01
 6.310e-01 6.410e-01 6.510e-01 6.610e-01 6.710e-01 6.810e-01 6.910e-01
 7.010e-01 7.110e-01 7.210e-01 7.310e-01 7.410e-01 7.510e-01 7.610e-01
 7.710e-01 7.810e-01 7.910e-01 8.010e-01 8.110e-01 8.210e-01 8.310e-01
 8.410e-01 8.510e-01 8.610e-01 8.710e-01 8.810e-01 8.910e-01 9.010e-01
 9.110e-01 9.210e-01 9.310e-01 9.410e-01 9.510e-01 9.610e-01 9.710e-01
 9.810

## Question 3

In [49]:
naive_bayes_model = MultinomialNB()
naive_bayes_model.fit(X_train, Y_train)
Y_predicted = naive_bayes_model.predict(X_test) # TODO: Is this even needed?

KeyboardInterrupt: 

TODO: "Devise a metric that identifies the terms most associated with each class label. How do these terms compare to those you identified in the previous equation?"

## Question 4

In [None]:
print('TODO')

## Question 5

In [41]:
X_train, X_test, Y_train, Y_test = train_test_split(X_set, Y_set, test_size = 0.1, random_state = 1)

print(X_train.shape)
print(X_test.shape)

(58235, 17890)
(6471, 17890)
