***


# From fringe revelry to growth industry

**Script used to analyze newspaper articles from the Guardian's 'dance corpus' **
<br/ >
**By Rens Wilderom, PhD candidate department of Sociology, University of Amsterdam**
<br/ >
**January, 2017**

***

***

# Part I:
### Exploring the initial corpus

***

In [1]:
# Import all necessary packages and such

from __future__ import print_function
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
import os
import os.path
import pandas as pd
import glob
from glob import glob


import warnings
warnings.filterwarnings('ignore') # only use this when you know the script and want to supress unnecessary warnings

# specify the main corpus paths. These will be used throughout the script
CORPUS_PATH = "C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles dance query/Guardian dance query/"
MACHINE_LEARNING = "C:/Users/renswilderom/Documents/Machine learning"

In [2]:
# Import dataset consisting of seperate txt file
import os, os.path, glob
os.chdir(CORPUS_PATH)
files = glob.glob("*.txt")

articles_original=[]
print("Constructing dataset, total number of documents included:")
for file in files: 
    with open(file, errors="ignore") as fi:
        articles_original.append(fi.read())
length=len(articles_original)
print(length)

Constructing dataset, total number of documents included:
10197


In [3]:
# https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py
english_stopwords = [
"length", "words", "reviewed", "www", "section", "byline", "author",    
"page", "features", "caption", "dateline", "said", "say", "says", "just", 
"year", "years", "day", "guardian", "new york times", "nytimes", "nyt", 
"parentheses", "python", "http", "url", "com", "correction", "appended", 
"type", "article", "date", "corrections", "mr", "mrs", "ms", "miss", "sir", 
"snp", "ind", "bnp", "rev","freeman", "hhh", "hhhh", "hhhhh", "pizazz", 
"org", "xfm", "cmp", "stx", "indl", "xxx", "dir", "est", "don", "est", 
"tel", "nnm", "mos", "tha", "ama", "der", "das", "bez", "les", "des", 
"pas", "thu", "mon", "mel", "sur", "moi", "rai", "che", "dab", "gus", 
"taj", "nyse", "dab", "tope", "taj", "smg", "ant", "january", "february", 
"march","april", "may", "june", "july", "august", "september", "october", 
"november", "december", "jan", "feb", "mar", "apr", "may", "june", "july", 
"aug", "sept", "oct", "nov", "dec", "monday", "tuesday", "wednesday", 
"thursday", "friday", "saturday", "sunday", "mondays", "tuesdays", 
"wednesdays", "thursdays", "fridays", "saturdays", "sundays",     
"a", "about", "above", "across", "after", "afterwards", "again", "against",   
"all", "almost", "alone", "along", "already", "also", "although", "always",
"am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
"any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
"around", "as", "at", "back", "be", "became", "because", "become",
"becomes", "becoming", "been", "before", "beforehand", "behind", "being",
"below", "beside", "besides", "between", "beyond", "bill", "both",
"bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
"could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
"down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
"elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
"everything", "everywhere", "except", "few", "fifteen", "fifty", "fill",
"find", "fire", "first", "five", "for", "former", "formerly", "forty",
"found", "four", "from", "front", "full", "further", "get", "give", "go",
"had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
"hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
"how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
"interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
"latterly", "least", "less", "ltd", "made", "many", "may", "me",
"meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
"move", "much", "must", "my", "myself", "name", "namely", "neither",
"never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
"nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
"once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
"ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
"please", "put", "rather", "re", "same", "see", "seem", "seemed",
"seeming", "seems", "serious", "several", "she", "should", "show", "side",
"since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
"something", "sometime", "sometimes", "somewhere", "still", "such",
"system", "take", "ten", "than", "that", "the", "their", "them",
"themselves", "then", "thence", "there", "thereafter", "thereby",
"therefore", "therein", "thereupon", "these", "they", "thick", "thin",
"third", "this", "those", "though", "three", "through", "throughout",
"thru", "thus", "to", "together", "too", "top", "toward", "towards",
"twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
"very", "via", "was", "we", "well", "were", "what", "whatever", "when",
"whence", "whenever", "where", "whereafter", "whereas", "whereby",
"wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
"who", "whoever", "whole", "whom", "whose", "why", "will", "with",
"within", "without", "would", "yet", "you", "your", "yours", "yourself",
"yourselves"]

In [4]:
# original vectorizer
tf_vectorizer_original = CountVectorizer(lowercase = True,
                                         strip_accents = 'unicode',
                                         stop_words = english_stopwords,
                                         token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters
                                         max_df = 0.5, # ignore words occuring in >50% of the corpus (i.e. corpus specific stop words)
                                         min_df = 10) # ignore words in <10 documents of the corpus
dtm_tf_original = tf_vectorizer_original.fit_transform(articles_original) 
print(dtm_tf_original.shape)

# https://mimno.infosci.cornell.edu/papers/schofield_tacl_2016.pdf
# no stemming and no lematization

(10197, 27557)


In [5]:
# for TF DTM
lda_tf_original = LatentDirichletAllocation(n_topics=20, random_state=0)
lda_tf_original.fit(dtm_tf_original)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1, n_topics=20,
             perp_tol=0.1, random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [6]:
# Conventional topics ORIGINAL

n_top_words = 30

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

tf_feature_names = tf_vectorizer_original.get_feature_names() 
print_top_words(lda_tf_original, tf_feature_names, n_top_words)

# create a doctopic matrix

filenames = sorted([os.path.join(CORPUS_PATH, fn) for fn in os.listdir(CORPUS_PATH)])

dtm_transformed = tf_vectorizer_original.fit_transform(articles_original)

doctopic = lda_tf_original.fit_transform(dtm_transformed)

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)

# Write doctopic to a csv file

os.chdir(MACHINE_LEARNING) 

filenamesclean = [fn.split('/')[-1] for fn in filenames]
i=0
with open('doctopic_original_guardian_dance.csv',mode='w') as fo:
    for rij in doctopic:
        fo.write('"'+filenamesclean[i]+'"')
        fo.write(',')
        for kolom in rij:
            fo.write(str(kolom))
            fo.write(',')
        fo.write('\n')
        i+=1
print("finsihed with creating doctopic matrix")

Topic #0:
carnival dst hill close brazilian alexis rio brazil sky notting parade sports shipping alex samba morrison caribbean world wild golf jaxx aerobics christina trinidad doherty sao outlook dancing floats surf
Topic #1:
city house pounds town street place hotel night bar local road park home food travel room island old water building beach open centre small built houses sea great restaurant best
Topic #2:
uni sony japanese japan london tokyo business wolverhampton management central bertelsmann yen electronics nokia warner met european westminster studies goldfrapp childs nana teesside universal german finance electronic bank mary keegan
Topic #3:
pounds sound cds eno recording flamenco disc spanish record tape emi gypsy recordings vinyl philips discs recorded spain audio electronic compact del madrid price player cassette midi polygram sounds tapes
Topic #4:
record black pop british people hip records artists culture hop london success early rap rock young york way britain popul

***

# Part II:
### Analizing a subcorpus

***

In [7]:
# open doctopic.csv and create a new row with variable names

csv_file = pd.read_csv("C:/Users/renswilderom/Documents/Machine learning/doctopic_original_guardian_dance.csv", header=None, index_col=False,
                  names = ["file", "t_0", "t_1", "t_2", "t_3", "t_4", "t_5", "t_6", "t_7", "t_8", "t_9", 
                           "t_10", "t_11", "t_12", "t_13", "t_14", "t_15", "t_16", "t_17", "t_18", "t_19"])

# Load the xls file as a dataframe
df = csv_file
df

Unnamed: 0,file,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_10,t_11,t_12,t_13,t_14,t_15,t_16,t_17,t_18,t_19
0,1985-01-03_435_Guardian.txt,0.000275,0.000275,0.000275,0.000275,0.000275,0.000275,0.011690,0.301622,0.000275,...,0.000275,0.000275,0.000275,0.440185,0.000275,0.000275,0.157583,0.045676,0.000275,0.000275
1,1985-01-08_623_Guardian.txt,0.000175,0.051726,0.000175,0.000175,0.039976,0.196020,0.000175,0.093959,0.086182,...,0.000175,0.061606,0.016895,0.000175,0.000175,0.451531,0.000175,0.000175,0.000175,0.000175
2,1985-01-09_203_Guardian.txt,0.000481,0.000481,0.165677,0.133565,0.000481,0.316699,0.000481,0.000481,0.000481,...,0.241214,0.000481,0.000481,0.000481,0.135633,0.000481,0.000481,0.000481,0.000481,0.000481
3,1985-01-16_481_Guardian.txt,0.000249,0.000249,0.000249,0.012171,0.073473,0.053693,0.063484,0.591342,0.000249,...,0.064351,0.000249,0.000249,0.121890,0.000249,0.000249,0.000249,0.016610,0.000249,0.000249
4,1985-01-21_787_Guardian.txt,0.011814,0.000147,0.000147,0.000147,0.000147,0.638669,0.008506,0.000147,0.000147,...,0.000147,0.161232,0.000147,0.102486,0.000147,0.000147,0.039937,0.000147,0.006344,0.006002
5,1985-01-25_1125_Guardian.txt,0.000108,0.024564,0.000108,0.000108,0.027083,0.393150,0.000108,0.010348,0.000108,...,0.015020,0.267666,0.000108,0.235755,0.000108,0.004413,0.000108,0.018410,0.000108,0.002508
6,1985-01-29_350_Guardian.txt,0.000298,0.000298,0.000298,0.106494,0.463143,0.388062,0.000298,0.000298,0.000298,...,0.000298,0.000298,0.000298,0.000298,0.037539,0.000298,0.000298,0.000298,0.000298,0.000298
7,1985-01-31_58_Guardian.txt,0.001471,0.001471,0.001471,0.001471,0.001471,0.301166,0.001471,0.001471,0.001471,...,0.001471,0.475437,0.001471,0.001471,0.001471,0.079627,0.001471,0.001471,0.001471,0.001471
8,1985-02-01_99_Guardian.txt,0.000980,0.000980,0.000980,0.000980,0.253415,0.000980,0.000980,0.000980,0.000980,...,0.000980,0.193651,0.084067,0.000980,0.000980,0.316153,0.000980,0.138008,0.000980,0.000980
9,1985-02-13_220_Guardian.txt,0.000526,0.112817,0.000526,0.000526,0.147171,0.095851,0.000526,0.146968,0.072727,...,0.000526,0.000526,0.000526,0.072760,0.000526,0.000526,0.000526,0.081117,0.000526,0.000526


In [8]:
# calculate mean, std, cutoff high, and cutoff low

df_1 = df.describe().loc[['mean','std']]
df2 = df_1.transpose()
df2['cutoff_high'] = df2['mean'] + 2*df2['std'] 
df2['cutoff_low'] = df2['mean'] + df2['std'] 
df2

Unnamed: 0,mean,std,cutoff_high,cutoff_low
t_0,0.003499,0.012014,0.027527,0.015513
t_1,0.041334,0.080791,0.202915,0.122125
t_2,0.001673,0.011848,0.02537,0.013522
t_3,0.004716,0.017277,0.039269,0.021993
t_4,0.084188,0.10287,0.289929,0.187059
t_5,0.075631,0.139887,0.355405,0.215518
t_6,0.002617,0.008905,0.020428,0.011522
t_7,0.121854,0.18349,0.488834,0.305344
t_8,0.028849,0.055902,0.140653,0.084751
t_9,0.037224,0.096603,0.23043,0.133827


In [9]:
# Select the appropriate cutoff point per topic from the table above

# 18

t_18_cutoff_high = df2.get_value('t_18', 'cutoff_high')
t_18_cutoff_low = df2.get_value('t_18', 'cutoff_low')


In [10]:
# These values are used to create new 'dance_high' and 'dance_low' dummies in the original df 
# 18

df['dance_high'] = '0'
df['dance_high'][(df['t_18'] > t_18_cutoff_high)
                ] = '1' 

df['dance_low'] = '0'
df['dance_low'][(df['t_18'] > t_18_cutoff_low)
                ] = '1' 
df

Unnamed: 0,file,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_12,t_13,t_14,t_15,t_16,t_17,t_18,t_19,dance_high,dance_low
0,1985-01-03_435_Guardian.txt,0.000275,0.000275,0.000275,0.000275,0.000275,0.000275,0.011690,0.301622,0.000275,...,0.000275,0.440185,0.000275,0.000275,0.157583,0.045676,0.000275,0.000275,0,0
1,1985-01-08_623_Guardian.txt,0.000175,0.051726,0.000175,0.000175,0.039976,0.196020,0.000175,0.093959,0.086182,...,0.016895,0.000175,0.000175,0.451531,0.000175,0.000175,0.000175,0.000175,0,0
2,1985-01-09_203_Guardian.txt,0.000481,0.000481,0.165677,0.133565,0.000481,0.316699,0.000481,0.000481,0.000481,...,0.000481,0.000481,0.135633,0.000481,0.000481,0.000481,0.000481,0.000481,0,0
3,1985-01-16_481_Guardian.txt,0.000249,0.000249,0.000249,0.012171,0.073473,0.053693,0.063484,0.591342,0.000249,...,0.000249,0.121890,0.000249,0.000249,0.000249,0.016610,0.000249,0.000249,0,0
4,1985-01-21_787_Guardian.txt,0.011814,0.000147,0.000147,0.000147,0.000147,0.638669,0.008506,0.000147,0.000147,...,0.000147,0.102486,0.000147,0.000147,0.039937,0.000147,0.006344,0.006002,0,0
5,1985-01-25_1125_Guardian.txt,0.000108,0.024564,0.000108,0.000108,0.027083,0.393150,0.000108,0.010348,0.000108,...,0.000108,0.235755,0.000108,0.004413,0.000108,0.018410,0.000108,0.002508,0,0
6,1985-01-29_350_Guardian.txt,0.000298,0.000298,0.000298,0.106494,0.463143,0.388062,0.000298,0.000298,0.000298,...,0.000298,0.000298,0.037539,0.000298,0.000298,0.000298,0.000298,0.000298,0,0
7,1985-01-31_58_Guardian.txt,0.001471,0.001471,0.001471,0.001471,0.001471,0.301166,0.001471,0.001471,0.001471,...,0.001471,0.001471,0.001471,0.079627,0.001471,0.001471,0.001471,0.001471,0,0
8,1985-02-01_99_Guardian.txt,0.000980,0.000980,0.000980,0.000980,0.253415,0.000980,0.000980,0.000980,0.000980,...,0.084067,0.000980,0.000980,0.316153,0.000980,0.138008,0.000980,0.000980,0,0
9,1985-02-13_220_Guardian.txt,0.000526,0.112817,0.000526,0.000526,0.147171,0.095851,0.000526,0.146968,0.072727,...,0.000526,0.072760,0.000526,0.000526,0.000526,0.081117,0.000526,0.000526,0,0


In [11]:
# How many dance articles do I have according to the high criterion?

df3 = df[df.dance_high != '0']
df4 = df3[['file']]
df4.shape

(393, 1)

In [12]:
# How many dance articles do I have according to the low criterion?

df3 = df[df.dance_low != '0']
df5 = df3[['file']]
df5.shape

(797, 1)

In [13]:
# Create lists of file names beloning to the subcorpus 'high' and 'low'
# Probably this can be done in a more straightforward fashion... (but this works)

os.chdir(MACHINE_LEARNING) 

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('list_low_guardian_dance.xlsx', engine='xlsxwriter')

# Convert the dataframe to an XlsxWriter Excel object.
df5.to_excel(writer, sheet_name='Sheet1')

# Close the Pandas Excel writer and output the Excel file.
writer.save()

In [14]:
# Copy the subcorpus from its original folder to a new destination folder. 

import shutil
import os

# Create A folder for dance articles, if the folder does not exists.
if not os.path.exists("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles dance query/dance articles guardian low"):
    os.makedirs("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles dance query/dance articles guardian low")  
CORPUS_PATH_LOW = "C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles dance query/dance articles guardian low"

os.chdir(CORPUS_PATH)

# the following list of articles are dance articles:
files_tocopy = pd.read_excel("C:/Users/renswilderom/Documents/Machine learning/list_low_guardian_dance.xlsx") 
files_tocopy = files_tocopy['file'].apply(lambda x: x.replace('"', "")).tolist()


for f in files_tocopy:
    shutil.copy(f, CORPUS_PATH_LOW)   
     
        
print ("Done with copying files")   

Done with copying files


In [15]:
# Import dataset consisting of seperate txt file
import os, os.path, glob
os.chdir(CORPUS_PATH_LOW)
files = glob.glob("*.txt")

articles_low=[]
print("Constructing dataset, total number of documents included:")
for file in files: 
    with open(file, errors="ignore") as fi:
        articles_low.append(fi.read())
length=len(articles_low)
print(length)

Constructing dataset, total number of documents included:
797


In [16]:
# original vectorizer
tf_vectorizer_low = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf_low = tf_vectorizer_low.fit_transform(articles_low) 
print(dtm_tf_low.shape)

# What about stemming (says, say, said)

(797, 3736)


In [17]:
# LDA TF DTM
lda_tf_low = LatentDirichletAllocation(n_topics=20, random_state=0)
lda_tf_low.fit(dtm_tf_low)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1, n_topics=20,
             perp_tol=0.1, random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [18]:
# Conventional topics LOW

n_top_words = 30

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

tf_feature_names = tf_vectorizer_low.get_feature_names() 
print_top_words(lda_tf_low, tf_feature_names, n_top_words)

# create a doctopic matrix LOW

filenames = sorted([os.path.join(CORPUS_PATH_LOW, fn) for fn in os.listdir(CORPUS_PATH_LOW)])

dtm_transformed = tf_vectorizer_low.fit_transform(articles_low)

doctopic = lda_tf_low.fit_transform(dtm_transformed)

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)

# Write doctopic to a csv file

os.chdir(MACHINE_LEARNING) 

filenamesclean = [fn.split('/')[-1] for fn in filenames]
i=0
with open('doctopic_low.csv',mode='w') as fo:
    for rij in doctopic:
        fo.write('"'+filenamesclean[i]+'"')
        fo.write(',')
        for kolom in rij:
            fo.write(str(kolom))
            fo.write(',')
        fo.write('\n')
        i+=1
print("finsihed with creating doctopic matrix")

Topic #0:
dublin bomb colston hayter aged court trial told conspiracy lines agreed detective liverpool public switch denies terrorism manchester alleged william christmas retired pub deny guilty charge inspector continues patrick birmingham
Topic #1:
letters directly scenes reply option parties mrs councils views organisers asked acid pollution jail profits authorities minister fines office environment local powers private police size necessary wrote proposed sentences letter
Topic #2:
police party letters officers acid drug man drugs day cell prison time breach yesterday just young arrested crack peace london year night hospital staff like pounds mrs hand officer west
Topic #3:
sponsored final bright graham fines stages acid aged police passed jail dancer mrs court yesterday danced drugs parties pounds claim private shall brief tory spot arrested boots member somewhat party
Topic #4:
police party yesterday parties acid rave new court officers law pounds public government organisers tr

***

# Part III:
### Exploring the top articles per topic

***

In [19]:
# Open the CSV file produced in the cell above in order to explore the top articles related to the topic of interest

# IMPORTANT: choose the appripriate CSV _original, _high, or _low


import pandas as pd
csv_file = pd.read_csv("C:/Users/renswilderom/Documents/Machine learning/doctopic_original.csv", header=None, index_col=False,
                  names = ["file", "t_0", "t_1", "t_2", "t_3", "t_4", "t_5", "t_6", "t_7", "t_8", "t_9", "t_10", 
                           "t_11", "t_12", "t_13", "t_14", "t_15", "t_16", "t_17", "t_18", "t_19"])

# When creating a row with new names, be careful not to overwrite the original first row.
# Load the xls file a dataframe
df = csv_file

print(df.shape)

(5244, 21)


In [20]:
# What is the topic of interest?

topic_of_interest = "t_18"

# Set the directory, this is based on on the same location as the doctopic matrix

os.chdir(CORPUS_PATH)

In [21]:
# rank texts in decending order

df1 = df[['file', topic_of_interest]] 
df2 = df1.sort_values(topic_of_interest, ascending=False)
df3 = df2.head(50)
df3

Unnamed: 0,file,t_18
616,1990-09-15_1638_Guardian.txt,0.933113
625,1990-09-28_1514_Guardian.txt,0.919245
179,1987-12-12_1648_Guardian.txt,0.906123
592,1990-07-28_2111_Guardian.txt,0.894471
914,1992-11-13_635_Guardian.txt,0.874363
1769,1997-12-01_71_Guardian.txt,0.860342
3767,2002-11-11_1229_Guardian.txt,0.859985
1156,1994-09-17_8756_Guardian.txt,0.858149
3429,2001-11-03_5013_Guardian.txt,0.849932
1713,1997-09-27_662_Guardian.txt,0.837968


In [None]:
# print text in rank 1 (.iloc[0] prints the highest ranked text;.iloc[1] prints the second highest ranked text, and so on)

interest1 = df3['file'].iloc[0]
file  = open(interest1, 'r+') 
file.read().splitlines()

***

# End of script

***

***

### More links and resources:

#### This is the original script on which this notebook is based: http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb

#### TF-IDF vectorizers: http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/

#### MMDS is dimension reduction via Jensen-Shannon Divergence & Metric Multidimensional Scaling: http://bugra.github.io/work/notes/2014-03-16/jensen-shannon-divergence-matrix-multi-dimensional-scaling/

#### Working with Markdown: http://datascience.ibm.com/blog/markdown-for-jupyter-notebooks-cheatsheet/

#### Notebook shortcuts: https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/pdf_bw/

#### Number of topics I: https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set

#### Number of topics II: https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/

#### Pandas: https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/

#### Fit transform: https://datascience.stackexchange.com/questions/12321/difference-between-fit-and-fit-transform-in-scikit-learn-models


***