# Problem set 8: Mini-project

We've put some effort into building our collection (see problem set 7 for details and for links to texts and to metadata). Now it's time to learn something about it. You already have lots of excellent ideas for how to apply the tools we've learned about so far. It's also a good time of the semester to review what we have learned and practice applying it in less structured settings.

**You will work by yourself or in a group of up to three people** to complete a short project applying methods from the previous weeks to this collection. You will turn in the completed project as a single notebook (one submission per group) with the following sections:

1. **Question(s).** Describe what you wanted to learn. Suggest several possible answers or hypotheses, and describe in general terms what you might expect to see if each of these answers were true (save specific measurements for the next section). For example, many students want to know the difference between horror and non-horror, or between detective stories and horror fiction, but there are many ways to operationalize this question. You do not need to limit yourself to questions of genre. **Note that your question should be interesting! If the answer is obvious before you begin, or if it's something the importance of which you cannot explain, your grade will suffer (a lot).** (10 points)

1. **Methods.** Describe how you will use computational methods presented so far in this class to answer your question. What do the computational tools do, and how does their output relate to your question? Describe how you will process the collection into a form suitable for a model or algorithm and why you have processed it the way you have. (10 points)

1. **Code.** Carry out your experiments. Code should be correct (no errors) and focused (unneeded code from examples is removed). Use the notebook format effectively: code may be incorporated into multiple sections. (20 points)

1. **Results and discussion.** Use sorted lists, tables, and visual presentations to make your argument. Excellent projects will provide multiple views of results, and follow up on any apparent outliers or strange cases, including through careful reading of the original documents. (40 points)

1. **Reflection.** Describe your experience in this process. What was harder or easier than you expected? What compromises or negotiations did you have to accept to match the collection, the question, and the methods? What would you try next? (10 points)

1. **Responsibility and resources consulted.** Credit any online sources (Stack Overflow, blog posts, documentation) that you found helpful. (0 points, but -10 if missing)
    * **If you worked in a group**, set up a group submission in CMS. Each group member should submit (via CMS) a separate text file in which they describe each member's (including their own) contributions to the project.
    * Most people will turn in *either* a completed notebook for their solo project *or* a responsibility statement. The only people who will submit both files are those who are the designated submitter for their group. Don't worry if CMS warns you about a missing file (unless you're the group submitter).

Note that 10 points will be carried over from problem set 7.

**We will grade this work based on accuracy, thoroughness, creativity, reflectiveness, and quality of presentation.**

**Scope:** this is a *mini*-project, with a short deadline. We are expecting work that is consistent with that timeframe, but that is serious, thoughtful, and rigorous. This problem set will almost certainly require more time and effort than many of the others. **For group work, the expected scope grows linearly with the number of participants.**

# 0. Project team

List here the members of your project team, including yourself.

Eric Sun, Shalin Mehta

# 1. Question(s)

## Question 1

### Is there a significant difference in the way male and female writers create their books, and can we classify this?

Our first question is basically to predict whether a book is written by a male or female author. We believe that this is accomplishable because from the papers we've been reading throughout this semester, it seems like previous work suggests that there are ways to differentiate male and female writers (for example, in Piper's "Characterization" they talk about how they found more female authors in older books wrote more introspective characters). 

Therefore we plan to use a bag of words model (Tfidf) to predict whether a text is written by a male or female author. We believe this will perform reasonably well, because we think that some words might be used a lot more by female authors more so than males. We also think this is a good dataset to try this on because it's very balanced in terms of gender. One thing we think might affect this is the genre of the books we use. For example, detective fiction novels might be very similar regardless of the author's gender, because those typically focus on very specific things.

+ Hypothesis: we think that the author's gender can be predicted through the words they use in their text
+ Our suggested answers: we might expect to see that female authors tend to use more introspective words as suggested in Piper's paper. Or we might find that it's very difficult to differentiate the two, which might be a result of the genres this corpus is made of.

## Question 2

### Can we predict whether a novel was adapted or not based on multiple features?

We think we can predict pretty well whether a novel was adapted or not based on features like (# of downloads, year it was published, wordcount, pov). We believe these features can in a way measure how 'popular' or 'well-received' a text was, and therefore whether or not it was adapted. We intend to use something like logistic regression or random forest to predict this.

+ Hypothesis: we believe that whether or not a novel is adapted can be predicted using various features like # of downloads and year it was published.
+ Our suggested answers: we think that these features will be able to predict pretty accurately, but if it isn't then it might be due to things like project gutemburg's download count not being a good enough indicator

# 2. Methods

## Question 1

For this we intend to use Tfidf vectorizer, which is basically a normalized bag of words model using the inverse document frequency as a weight. We chose TFIDF over just a normal BoW counter because it's normalized and provides more unique words. We intend to remove stopwords using the sklearn stopwords from TFIDF. We believe this is okay because we're not super familiar with our dataset and don't have specific words we are certain we want to keep, so these stopwords are good enough to help remove too common words that aren't helpful.

We intend to try a couple of different models. Our first model is Random Forest, because it's a very strong method that's very easy to use and will hopefully perform very well. Our second model is logistic regression, which is also very powerful. Both methods can give us a percentage value of how correct our value is too (random forest by telling us the percentage of trees that were correct, and logistic regression which just gives us the percent value). We also intend to try SVM, which is a very powerful classifier that can give extremely good results.

## Question 2

For this question we intend to simply use the features from the csv metadata file. The features we intend to use are  

number of downloads, wordcount, year of publish, point of view, and genre (horror, detective, or neither horror nor detective).

We believe these features are helpful in predicting the outcome, and will measure their effectiveness through our models.

The models we intend to use are random forest and logistic regression. These are good models because they can tell us how good our variables were in predicting the outcome, and can also give us a percentage for how confident they are in an answer.

# 3. Code

In [138]:
# Imports (all of them!)
import pandas as pd
import numpy as np
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler

from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from   sklearn.decomposition import TruncatedSVD

import seaborn as sns
import matplotlib.pyplot as plt

In [56]:
# getting our csv metadata and files
METADATA_FILE = 'info3350_lit_corpus.csv'
DATA_FOLDER = 'corpus'

data = pd.read_csv(METADATA_FILE)
display(data.describe())

# getting the file paths
filenames = data['filename']
# filenames = [f.strip() for f in list(filenames)]
filepaths = [os.path.join('corpus',f) for f in filenames]

Unnamed: 0,owner,check_1,check_2,filename,title,author_surname,author_givenname,year,genre,form,country,wordcount,language,pov,gender,horror,detective,adaptation,downloads,source_url
count,138,123,92,138,138,138,138,137,138,138,138,135,138,138,138,138,138,136,132,137
unique,69,53,50,138,137,81,84,83,84,6,11,134,3,2,2,2,2,2,129,136
top,as2778,db758,swt53,The_Great_God_Pan.txt,Plague Ship,Norton,Andre,1922,Science Fiction,novel,us,60020,en,third,female,False,False,False,262,http://www.gutenberg.org/ebooks/4047
freq,3,4,4,1,2,10,10,5,8,118,64,2,134,75,71,97,104,82,2,2


In [58]:
# this is a check to find out if any of the files aren't accessible / causing issues
for f in filepaths:
    if not os.path.isfile(f):
        print(f)

for f in filepaths:
    try:
        file = open(f, encoding='utf-8')
        text = file.read()
    except UnicodeDecodeError:
        print(f)

In [43]:
# copied from Prof's code from HW: makes a colored dictionary of results
def compare_scores(scores_dict):
    '''
    Takes a dictionary of cross_validate scores.
    Returns a color-coded Pandas dataframe that summarizes those scores.
    '''
    import pandas as pd
    df = pd.DataFrame(scores_dict).T.applymap(np.mean).style.background_gradient(cmap='RdYlGn')
    return df

## Question 1 Code

In [44]:
# Question 1

# Data Gathering
vectorizer = TfidfVectorizer(
    input='filename',
    encoding='utf-8',
    binary=False,
    norm='l2',
#     max_df=0.8,
    min_df=0.5,
    stop_words='english',
    use_idf=True
)

X = vectorizer.fit_transform(filepaths)
y = [0 if g == 'female' else 1 for g in list(data['gender'])]

In [45]:
# Model training
models = {
    'random forest':RandomForestClassifier(n_estimators=300),
    'svm-linear':svm.SVC(kernel='linear'),
    'svm-rbf':svm.SVC(kernel='rbf'),
    'svm-linear-c=0.8':svm.SVC(kernel='linear', C=0.8),
    'svm-rbf-c=0.8':svm.SVC(kernel='rbf', C=0.8),
    'logistic':LogisticRegression()
}

scores = {}
for name, model in models.items():
    score = cross_validate(model, X, y, cv=5, 
                                  scoring=['accuracy', 'f1', 'f1_macro', 'f1_micro'], return_estimator=True)
    models[name] = score['estimator'][np.argmax(score['test_accuracy'])] # save the best model!
    del score['estimator'] # delete it so compare_scores still works
    scores[name] = score
compare_scores(scores)

Unnamed: 0,fit_time,score_time,test_accuracy,test_f1,test_f1_macro,test_f1_micro
random forest,1.972171,0.142947,0.730952,0.724658,0.725425,0.730952
svm-linear,0.259116,0.060757,0.746296,0.76202,0.740326,0.746296
svm-rbf,0.283741,0.100956,0.731746,0.741354,0.729975,0.731746
svm-linear-c=0.8,0.408696,0.116782,0.724603,0.73207,0.720017,0.724603
svm-rbf-c=0.8,0.261718,0.064137,0.746032,0.759926,0.743448,0.746032
logistic,0.040122,0.004156,0.753968,0.741897,0.752117,0.753968


In [140]:
# Feature importance: find which words were most important in predicting outcome
importance = models['random forest'].feature_importances_
indices = np.argsort(importance)[::-1]
words = vectorizer.get_feature_names()
topWords = []
for i,ind in enumerate(indices):
    print(words[ind], importance[ind])
    topWords.append(words[ind])
    if i > 10: #get the best 10 words
        break

quantity 0.006698449897931407
eager 0.005767653500411427
impatiently 0.004901586915914168
questioned 0.004442634777894013
longed 0.004335125668674868
horrible 0.0041779177705078985
smoking 0.004085729332535447
presently 0.004061796292280788
devil 0.003771350012669546
fright 0.003731517017168895
thirty 0.0036316524084214485
nest 0.003604400839163455


In [156]:
# collecting the percentage of female/male authored books each word shows up in

# count the # of times each of our top words appears in each book
counter = CountVectorizer(
    input='filename',
    encoding='utf-8',
    binary=False,
    vocabulary=topWords
)

# count the number of times that word appears for male books or female books
count_vector = counter.fit_transform(filepaths)
wordToBook = {word:{'male':0,'female':0} for word in topWords}
for i,file in enumerate(filepaths):
    for j,word in enumerate(topWords):
        if count_vector[i,j] > 0:
            if y[i]: # male
                wordToBook[word]['male'] += count_vector[i,j]
            else:
                wordToBook[word]['female'] += count_vector[i,j]

# print our result as a percentage
for word,value in wordToBook.items():
    total = sum(list(value.values()))
    female = 0
    male = 0
    for gender,count in value.items():
        if gender == 'male':
            male = count / total
        else:
            female = count / total
    print("{}: {:.2f} male {:.2f} female".format(word, male, female))

quantity: 0.67 male 0.33 female
eager: 0.36 male 0.64 female
impatiently: 0.40 male 0.60 female
questioned: 0.43 male 0.57 female
longed: 0.21 male 0.79 female
horrible: 0.71 male 0.29 female
smoking: 0.71 male 0.29 female
presently: 0.72 male 0.28 female
devil: 0.66 male 0.34 female
fright: 0.65 male 0.35 female
thirty: 0.60 male 0.40 female
nest: 0.35 male 0.65 female


## Question 2 Code

In [144]:
# getting data and skipping values that are NAN or a range (ex. 1809-1849)

# we get a list of each feature we want. If the value is missing or invalid, we just skip that entire row
years = []
wordcount = []
downloads = []
detective = []
horror = []
adaptation_y = []
for _, row in data.iterrows():
    try:
        year = int(row['year'])
        wc = int(row['wordcount'].replace(',',''))
        dl = int(row['downloads'].replace(',',''))
        years.append(year)
        wordcount.append(wc)
        downloads.append(dl)
        det = 1 if row['detective'] else 0
        detective.append(det)
        hor = 1 if row['horror'] else 0
        horror.append(hor)
        adaptation = 1 if row['adaptation'] else 0
        adaptation_y.append(adaptation)
    except (ValueError, AttributeError): #skip the row because one of our features is invalid - nan or not a int
        print("Skipping row")
        print(row['title'], row['year'], row['wordcount'], row['downloads'], row['adaptation'])

Skipping row
And Then There Were None 1939 53,921 nan True
Skipping row
The Count of Monte Cristo  nan 464, 234 11093 True
Skipping row
The Great Gatsby 1925 48410 nan True
Skipping row
The Paradise Mystery 1920 nan 161 False
Skipping row
The Shadow Over Innsmouth 1936 nan nan True
Skipping row
The Sorcery Club 1912 nan 66 False
Skipping row
The Works of Edgar Allan Poe - Volume 4 by Edgar Allan Poe 1809-1849 85,975 719 False
Skipping row
The Works of Edgar Allan Poe - Volume 5 by Edgar Allan Poe 1809-1849 72179 924 True
Skipping row
The Age of Innocence 1920 101254 nan nan
Skipping row
Their Eyes Were Watching God 1937 75952 nan True
Skipping row
The Mist 1980 61568 nan True


In [145]:
# preprocessing data by scaling it
adaptation_feats = [years, wordcount, downloads, detective, horror]
scaler = StandardScaler()

cols = []
for col in adaptation_feats:
    col = np.array(col).reshape(-1,1)
    scaled_col = scaler.fit_transform(col)
    cols.append(scaled_col)

# convert our data into a feature matrix
adaptation_X = np.hstack(cols)
print(adaptation_X.shape)
print(len(adaptation_y))

(127, 5)
127


In [146]:
# modelling
adaptation_models = {
    'random forest':RandomForestClassifier(n_estimators=300),
    'svm-linear':svm.SVC(kernel='linear'),
    'svm-rbf':svm.SVC(kernel='rbf'),
    'svm-linear-c=0.8':svm.SVC(kernel='linear', C=0.8),
    'svm-rbf-c=0.8':svm.SVC(kernel='rbf', C=0.8),
    'logistic':LogisticRegression()
}

adaptation_scores = {}
for name, model in adaptation_models.items():
    score = cross_validate(model, adaptation_X, adaptation_y, cv=5, 
                                  scoring=['accuracy', 'f1', 'f1_macro', 'f1_micro'], return_estimator=True)
    adaptation_models[name] = score['estimator'][np.argmax(score['test_accuracy'])]
    del score['estimator']
    adaptation_scores[name] = score
compare_scores(adaptation_scores)

Unnamed: 0,fit_time,score_time,test_accuracy,test_f1,test_f1_macro,test_f1_micro
random forest,0.874644,0.04943,0.699385,0.537544,0.655631,0.699385
svm-linear,0.002092,0.005353,0.66,0.389872,0.576258,0.66
svm-rbf,0.002011,0.007034,0.644,0.37511,0.562883,0.644
svm-linear-c=0.8,0.001652,0.004731,0.66,0.389872,0.576258,0.66
svm-rbf-c=0.8,0.002304,0.004957,0.636308,0.354158,0.550273,0.636308
logistic,0.014166,0.012821,0.659385,0.439137,0.594725,0.659385


In [147]:
# Feature importance
importance = adaptation_models['random forest'].feature_importances_
indices = np.argsort(importance)[::-1]
adaptation_feats_names = ['years','wordcount','downloads','detective','horror']
for i,ind in enumerate(indices):
    print(adaptation_feats_names[ind], importance[ind])

downloads 0.47909362785619897
wordcount 0.2669241981305822
years 0.2099144889887458
detective 0.022595388454808057
horror 0.021472296569665147


# 4. Results and discussion

# 5. Reflection

# 6. Responsibility and resources consulted