<h1> Making It Count Homework

In [2]:
import pandas as pd
import altair as alt

<h3>First, make sure you have the Humanist listserv dataset and that you have dates associated with each volume

In [3]:
humanist_vols = pd.read_csv('web_scraped_humanist_listserv_volumes.csv')
humanist_vols

Unnamed: 0,volume_text,volume_link,volume_dates,volume_number,inferred_start_year,inferred_end_year
0,From: MCCARTY@UTOREPAS\nSubject: \nDate: 12 Ma...,https://humanist.kdl.kcl.ac.uk/Archives/Conver...,1987-1988,1,1987,1988
1,From: Sebastian Rahtz \nSubject: C++ and Gnu o...,https://humanist.kdl.kcl.ac.uk/Archives/Conver...,1988-1989,2,1988,1989
2,From: Willard McCarty \nSubject: Happy Birthda...,https://humanist.kdl.kcl.ac.uk/Archives/Conver...,1989-1990,3,1989,1990
3,From: Elaine Brennan & Allen Renear \nSubject:...,https://humanist.kdl.kcl.ac.uk/Archives/Conver...,1990-1991,4,1990,1991
4,From: Elaine Brennan & Allen Renear \nSubject:...,https://humanist.kdl.kcl.ac.uk/Archives/Conver...,1991-1992,5,1991,1992
5,From: Elaine M Brennan \nSubject: Humanist's B...,https://humanist.kdl.kcl.ac.uk/Archives/Conver...,1992-1993,6,1992,1993
6,From: 6500card%ucsbuxa@hub.ucsb.edu (Cheryl A....,https://humanist.kdl.kcl.ac.uk/Archives/Conver...,1993-1994,7,1993,1994
7,From: Andrew Burday \nSubject: Re: 7.0638 Qs: ...,https://humanist.kdl.kcl.ac.uk/Archives/Conver...,1994-1995,8,1994,1995
8,"From: ""Gregory Bloomquist"" \nSubject: Round Ta...",https://humanist.kdl.kcl.ac.uk/Archives/Conver...,1995-1996,9,1995,1996
9,From: Humanist \nSubject: Humanist begins its ...,https://humanist.kdl.kcl.ac.uk/Archives/Conver...,1996-1997,10,1996,1997


<h3>Second, would highly recommend using TF-IDF to help identify what is distinctive of these two time periods.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [24]:
# Create a vectorizer
vectorizer = TfidfVectorizer(max_df=0.8)
# Fit the vectorizer to our documents
transformed_documents = vectorizer.fit_transform(humanist_vols['volume_text'])

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(transformed_documents, humanist_vols['period'], test_size=0.2, random_state=0)

# Train a logistic regression classifier
clf = LogisticRegression(max_iter=1000, penalty='l2', solver='liblinear')
clf.fit(X_train, y_train)

# Predict the time period of the test set
y_pred = clf.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

                precision    recall  f1-score   support

  contemporary       1.00      1.00      1.00         3
early_internet       0.67      1.00      0.80         2
       web_2.0       1.00      0.50      0.67         2

      accuracy                           0.86         7
     macro avg       0.89      0.83      0.82         7
  weighted avg       0.90      0.86      0.85         7



In [25]:
# Get the coefficients for each term
coefficients = clf.coef_
# Get the terms
terms = vectorizer.get_feature_names_out()
# Create a dataframe of the terms and coefficients
terms_df = pd.DataFrame({'term': terms, 'contemporary': coefficients[0], 'early_internet': coefficients[1], 'web_2.0': coefficients[2]})
# Get the top terms for each period
top_terms = terms_df.melt(id_vars='term', var_name='period', value_name='coefficient').sort_values(by='coefficient', ascending=False).groupby('period').head(100)
top_terms


Unnamed: 0,term,period,coefficient
426210,num,early_internet,1.424521
100504,digitalhumanities,contemporary,1.066091
179956,onlinehome,contemporary,0.628015
204639,s16382816,contemporary,0.628015
510436,2006,web_2.0,0.543027
...,...,...,...
135055,humanidades,contemporary,0.004578
125802,gmail,contemporary,0.004523
116600,fatal,contemporary,0.004514
92834,culingtec,contemporary,0.004481


<h3> Finally, you’ll need to visualize your results

In [26]:
# visualize top terms
top_terms['period'] = top_terms['period'].astype(str)
selection = alt.selection_point(fields=['term'], bind='legend')

# Define the sort order for the periods
period_order = ['early_internet', 'web_2.0', 'contemporary']

chart = alt.Chart(top_terms).mark_bar().encode(
    x=alt.X('period', sort=['early_internet', 'web_2.0', 'contemporary'], axis=alt.Axis(title='Period')),
    y=alt.Y('coefficient:Q'),  # Sort terms by score in descending order
    color=alt.Color('term', legend=alt.Legend(title='Term', orient='right', symbolLimit=len(top_terms['term'].unique()), columns=5), scale=alt.Scale(scheme='tableau20')),
    tooltip=['term', 'coefficient', 'period'],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_params(selection).properties(
    title='Top 10 Terms by Coefficient in Logistic Regression Model by Period'
)
chart