# Text Mining for Economics and Finance HW3 - Group 9

## Andres Brito Barreiro, Hans-Christian Aarnio, Hao Yao, Steven Kingaby

In [None]:
## Question 1

The corpus includes 64,706 reviews of Amazon Digital Music products, from 1997 to 2014. The code below uses the LASSO, Naive Bayesian Classifier, and Multinomial Inverse Regressio models.

In [9]:
# Import the required packages
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.model_selection import cross_val_score

from nltk.corpus import stopwords

In [5]:
# Load the dataset
ratings = pd.read_csv('Digital_Music_new.csv')

# Remove unnamed column as it's identical to the
# dataframe's default index. Furthermore reformat `Time`
# column values into datetime objects.
ratings = ratings.drop('Unnamed: 0', axis = 1)
ratings.Time = pd.to_datetime(ratings.Time, format = '%m %d, %Y')

ratings

Unnamed: 0,Rating,Review,Summary,Time
0,5.0,"It's hard to believe ""Memory of Trees"" came ou...",Enya's last great album,2006-09-12
1,5.0,"A clasically-styled and introverted album, Mem...",Enya at her most elegant,2001-06-03
2,5.0,I never thought Enya would reach the sublime h...,The best so far,2003-07-14
3,5.0,This is the third review of an irish album I w...,Ireland produces good music.,2000-05-03
4,4.0,"Enya, despite being a successful recording art...",4.5; music to dream to,2008-01-17
...,...,...,...,...
64701,4.0,I like the reggae sound a lot in this song. I ...,Cool song,2014-06-24
64702,5.0,I first heard this on Sirius and had to have i...,Great Song,2014-07-09
64703,5.0,"I absolutely love this song, it downloaded fin...",Five Stars,2014-07-13
64704,3.0,"Reggae, island beats aren't really my cup of t...",Well-crafted song,2014-07-09


According to the ratings of these products, we could classify our sample into two sets, highly rated and poorly rated. To do this, we use np.where with the condition that the Rating is greater than or equal to 4 out of 5 to split the observation into two groups, highly rated sample and poorly rated sample. If the product is highly rated, the indicator is 1; otherwise, it's 0.

In [6]:
ratings['Indicator'] = np.where(ratings['Rating'] >= 4.0, 1, 0)
ratings

Unnamed: 0,Rating,Review,Summary,Time,Indicator
0,5.0,"It's hard to believe ""Memory of Trees"" came ou...",Enya's last great album,2006-09-12,1
1,5.0,"A clasically-styled and introverted album, Mem...",Enya at her most elegant,2001-06-03,1
2,5.0,I never thought Enya would reach the sublime h...,The best so far,2003-07-14,1
3,5.0,This is the third review of an irish album I w...,Ireland produces good music.,2000-05-03,1
4,4.0,"Enya, despite being a successful recording art...",4.5; music to dream to,2008-01-17,1
...,...,...,...,...,...
64701,4.0,I like the reggae sound a lot in this song. I ...,Cool song,2014-06-24,1
64702,5.0,I first heard this on Sirius and had to have i...,Great Song,2014-07-09,1
64703,5.0,"I absolutely love this song, it downloaded fin...",Five Stars,2014-07-13,1
64704,3.0,"Reggae, island beats aren't really my cup of t...",Well-crafted song,2014-07-09,0


## Question 2

Now we continue with the usual steps of preprocessing the reviews and summary columns. First, we write a self-defined function on text preprocessing. Then, we need to collect the documents as corpus.

In [10]:
stopwords = set(stopwords.words("english"))
stemming = PorterStemmer()

# Before tokenization it is useful to remove all words of length 1 and to make everything lowercase 
# and to remove words of length one.
def remove_one_letter_words(document):
    document.values[0] = re.sub(r'\b[a-zA-Z]\b', '', document.values[0]) 

    return document

def clean_tokens(row, column_name):
    documents = row[column_name] # converts the documents (i.e. report) column of BoE_data to row
    tokens = nltk.word_tokenize(documents) # uses the word_tokenize function to tokenize each 'row', i.e. report
    alpha_tokens = [w for w in tokens if w.isalpha()] # removes all non-alphabetic tokens
    return alpha_tokens

def remove_stopwords(row, column_name):
    tokens = row[column_name] # converts the tokens column of new BoE_data to row
    useful_words = [w for w in tokens if not w in stopwords] # takes only the words that are not in the stopwords set
    return useful_words

def create_stems(row, column_name):
    stopwords_list = row[column_name] # converts the stopwords column of new BoE_data to row
    stemmed_list = [stemming.stem(word) for word in stopwords_list] # stems each word in each BoE_data row
    return stemmed_list

# Creating clean data and turning our pre processed data into single strings for each report.
def rejoin_words(row):
    my_list = row['stems']
    joined_words = ( " ".join(my_list))
    return joined_words

def preprocess(raw_data):
    raw_data = raw_data.apply(lambda document: document.astype(str).str.lower(), axis=1) # making everything lowercase
    raw_data = raw_data.apply(lambda document: remove_one_letter_words(document), axis=1) # remove one letter words
    raw_data['tokens'] = raw_data.apply(lambda x : clean_tokens(x, 'document'), axis=1)
    raw_data['stopwords'] = raw_data.apply(lambda x : remove_stopwords(x, 'tokens'), axis=1)
    raw_data['stems'] = raw_data.apply(lambda x : create_stems(x, 'stopwords'), axis=1)
    raw_data['processed'] = raw_data.apply(rejoin_words, axis=1)

    Clean up and remove columns that aren't needed in the long run.
    raw_data = raw_data.drop(['tokens', 'stopwords', 'stems', 'document'], axis = 1)

    return raw_data

# paragraph_data = preprocess(paragraph_data)
# print(paragraph_data)

## Question 3

## Question 4

## Question 5