# Decrypting cryptid

## Tools, technologies, & techniques featured in this notebook
- List TBD

In [91]:
import numpy as np
import pandas as pd
from numpy.linalg import svd
# import string

import matplotlib.pyplot as plt
%matplotlib inline

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
# from nltk.stem.porter import PorterStemmer
# from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.cluster import KMeans 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [92]:
from urllib.request import urlopen
import plotly.express as px

## Text ingestion
**Source selection**
- Observation data was imported from the file you provided. We looked for other sources, but found that the publishers of this information took the most care to be most credible. A team of researchers would follow up on each sighting with an interview and collection of 'evidence' and attempt to consistently classify the report. They only publish the top three tiers of credibility--A through C in order of most to least evidence.

**Data import and data wrangling**

- 'Beautiful Soup' module to get the html data into a usable format
- Data was pretty messy--think of looking through a filing cabinet for a document where the person who was in charge of filing didn't reliably put files in the right folders
- Straightened out the filing cabinet and then obtained year and month, state and county, 


## Text preprocessing functions and methods
- Machine learning models need to have text converted into a format that they can use. 
- The steps we took to turn the text into machine-readable 'data' included stripping punctuation, tokenizing, lemmatization, and removing stopwords.
- We then 'vectorized' each observation so that the models could compare them.


In [2]:
wordnet = WordNetLemmatizer()
# porter = PorterStemmer()
# snowball = SnowballStemmer('english')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/salvir1/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:

def remove_punc(string:str) -> str:
    '''Given a string, removes all punctuation and returned punctuation-less string'''
    return re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", string)

In [4]:
def tokenize(str):
    '''
    Tokenize a str and return a tokenized list.
    '''
    return [word for word in word_tokenize(str)]

In [5]:
def lemmatize(doc):
    '''Takes in a doc and lemmatizes tokens in doc
    Parameters
    ----------
    doc: list of tokens
    
    Returns
    -------
    lemmatized tokens
    '''
    return [wordnet.lemmatize(tkn) for tkn in doc]

In [6]:
def rm_stop_words(doc, stops=set(stopwords.words('english'))):
    '''Takes in a doc and removes stop words
    Parameters
    ----------
    doc: list of tokens
    
    Returns
    -------
    Tokens with stop words removed
    '''
    return([w for w in doc if w not in stops])

In [12]:
def preprocess_corpus(content):
    '''
    Add docstring. Make flexible to allow for doing, or not doing, preprocessing functions. 
    Parameters
    ----------
    content (str): a collection of strings
    Returns
    -------
    A list of lists: each list contains a tokenized version of the original string
    '''
    preprocessed = []
    for i in range(len(content)):
        step_1 = remove_punc(content[i].lower())
        step_2 = tokenize(step_1)
        step_3 = lemmatize(step_2)
        step_4 = rm_stop_words(step_3)
        preprocessed.append(step_4)
    return preprocessed

In [107]:
# loading bigfoot data
sightings_df = pd.read_csv('data/bigfoot_with_county.csv', index_col=0)

In [117]:
sightings_df['observations'] = sightings_df['observations'].astype(str)

### Preprocessing--data load and function calls

In [118]:
cleaned_tokenized = preprocess_corpus(sightings_df['observations']) # cleaned and tokenized

str_cleaned_tokenized = [" ".join(x) for x in cleaned_tokenized] # string version of cleaned and tokenized 

In [119]:
len(cleaned_tokenized)

4411

## Processing

In [120]:
# 'Bag of words function'
vect = CountVectorizer(max_features=500)
word_counts = vect.fit_transform(str_cleaned_tokenized)

In [121]:
len(vect.get_feature_names())

500

In [122]:
tfidfvect = TfidfVectorizer(max_features=500)
tfidf_vectorized = tfidfvect.fit_transform(str_cleaned_tokenized)
tfidf_vectorized.toarray()

array([[0.12824608, 0.0759244 , 0.07095224, ..., 0.        , 0.0518886 ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.03951046, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.06966028,
        0.        ],
       ...,
       [0.        , 0.        , 0.04487746, ..., 0.02769821, 0.        ,
        0.03032734],
       [0.        , 0.        , 0.        , ..., 0.20376512, 0.03449167,
        0.03187238],
       [0.        , 0.        , 0.        , ..., 0.15761544, 0.        ,
        0.        ]])

## Clustering with K Means

In [123]:
clusters = 5
kmeans = KMeans(n_clusters=clusters, 
                random_state=0).fit(tfidf_vectorized)

- Investigate the clusters  

> - Investigate the 'centroids' to find out what "topics" Kmeans has discovered by mapping these vectors back into the 'word space'.  Think of each feature/dimension of the centroid vector as representing the "average" article or the average occurrences of words for that cluster.
   
> - Find the features/dimensions with the greatest representation in the centroid.  Print out the top ten words for each centroid.


In [124]:
def Sort(sub_li): 
    return sorted(sub_li, key = lambda x: x[0], reverse=True)

def get_word(centroid):
    return [x[1] for x in centroid]

for k in range(5):
    matched = zip(kmeans.cluster_centers_[k], tfidfvect.get_feature_names())
    match = Sort(list(matched))
    print(' '.join(get_word(match[:24])), '\n')

track print wa inch snow footprint foot found toe area trail picture one long creek large road size human went like would could made 

wa creature saw tree large back foot area tall river seen looked dog house said wood one like heard ran around walking see two 

wa heard sound like scream night loud sounded noise time tent wood area howl dog back camp could one animal hear around went never 

wa road saw car driving side creature foot tall see looked hair back like dark around front seen large right highway home area could 

wa back saw like tree see could foot time wood would one around looked something area heard got friend thing went didnt never house 



For heirarchical clustering methods, see 819 am clustering assignment

## Cosine similarity
- Unsupervised learning

- Use the cosine similarity to compare similarity between documents.

- sklearn's [linear_kernel](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.linear_kernel.html) (computes dot product) can be used on tfidf to compute the cosine similarity since rows are normalized.*

- Here's a page on cosine similarity from [sklearn documentation](http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity) and a relevant [stack overflow post](http://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity).

- *The stack overflow post is helpful. It provides instruction over how to slice the tfidf and then how to apply cosine similarity between one doc and all of the rest.*

In [128]:
cosine_similarities = linear_kernel(tfidf_vectorized[1:2], tfidf_vectorized[1:500]).flatten() # This is comparing cs for article #2 and the next 500.

In [126]:
related_docs_indices = cosine_similarities.argsort()[:-6:-1] # This identifies the index of the top 5 most similar.
print(related_docs_indices)

most_similar = cosine_similarities[related_docs_indices] # and their related cs
most_similar

[  0 134 299  51  35]


array([1.        , 0.41648761, 0.40048277, 0.39099911, 0.38874647])

In [127]:
for i in related_docs_indices:
    print(sightings_df['observations'].iloc[i]) # Going step by step pulling up the most similar reports by index

I and two of my friends were bored one night so we decided to do a little snowmachining. Though it was illegal to snowmachine in Anchorage, there were some good trails to ride on a little north of my house.  We took off at probably 11 pm, rode up the road about a quarter mile, and cut off on the trails. It had snowed about 10 inches a few days before so there was fresh snow, with no tracks.  I was leading the way for about a half hour, then we stopped and talked for a little bit.  We took off again and kept cruising on some sort of game trail that led to an opening in the woods.  I rode off into the opening with my friends following about fifty yards behind me.  I came over this little mound and saw strange tracks leading to this spot in the snow where it looked like something had pushed aside some snow and layed down.  I figured it was just a moose or something.  But I followed the tracks over the next small hill and as I came down the far side my headlight pointed right on the back o

## Decompositions NMF (and SVD)
- Unsupervised learning
- Good for situations when there's some potentially valid grouping to both rows and columns, such as putting Joe and Sam in the same group because they like similar movies (as opposed to traditional supervised models where there are features and targets)
- See 820pm solution to NMF for good soft classification and test of classification


## Naive Bayes
- Supervised learning method to assign class probabilities to a document
- See 818PM NLP-pipeline-programming-net-example for using sklearn Naive Bayes classifier. See also 818PM lecture on text classification. Solutions to assignment contain a number of useful naive Bayes python functions

In [102]:
counties = pd.read_csv('data/US_FIPS_Codes.csv', header=1)
counties