# NYCDSA Blog Post Data- Natural Language Processing

**Objective:** 
1. Load and Clean Dataset for a WordCloud.
2. Use Latent Dirichlet Allocation (LDA) to group the blog posts into topics, not necessarily their original categories but maybe topics such as politics, sports, technology etc.
3. Further Study: Perform Text Classification to see if it is possible to group a blog post from category "Student Works" into a more specific category such as R, Web Scraping, Machine Learning etc.

- The NYCDSA Blog Post dataset used here "lda.csv" was web scraped from https://nycdatascience.com/blog/ on Oct 26, 2018. 

# Objective 2. Latent Dirichlet Allocation (LDA)

- LDA is an unsupervised method that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
 
### Steps:
- Extract term frequency feature from the corpus (NYCDSA Blog Data).
- Fit the LDA model, return the topics, and then inspect each topic (a topic is a common latent variable produced by LDA and used to characterize a document)
- Finally we inspect the relation between the topics and the particular documents.

In [6]:
# the following was done in python 2
import pandas as pd

from time import time

from sklearn.feature_extraction.text import CountVectorizer #counting object
from sklearn.decomposition import LatentDirichletAllocation #train lda

def print_top_words(model, feature_names, n_top_words): #print most freq words
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        print
        
#from before, df = pd.read_csv('lda.csv') 
df = pd.read_csv('lda.csv')  
dataset = df 
data_samples = dataset.post

n_features = 1000
#n_topics, number of topics = 10  
#n_top_words, top words defining each topic = 20

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,  #specify rules
                                stop_words='english')

tf = tf_vectorizer.fit_transform(data_samples) 

type(tf) #sparse matrix, matrix with a lot of 0's <class 'scipy.sparse.csr.csr_matrix'>, only record value of non-0s positions
type(tf.toarray()) #<type 'numpy.ndarray'>

numpy.ndarray

In [80]:
print tf.toarray().shape #each row is a document , each column is a word (top 1000 words)
print '-'*88

print tf.toarray()[958, 0:200] #shows article 958, words 0:200, after applying tf_vectorizer.fit_transform()
print '-'*88

print tf_vectorizer.get_feature_names()[50: 75] #words 50 - 75, found by vectorizor
print tf_vectorizer.get_feature_names()[975: 1000] #words 975 - 1000, when we specified max_features

(1215L, 1000L)
----------------------------------------------------------------------------------------
[1 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 3 0 1 1
 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0
 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0
 0 0 0 3 0 0 0 0 0 0 1 1 0 0 4 0 0 0 0 0 2 5 5 0 2 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
----------------------------------------------------------------------------------------
[u'american', u'ams', u'analysis', u'analytics', u'analyze', u'analyzing', u'annual', u'answer', u'api', u'app', u'appear', u'appears', u'application', u'applications', u'applied', u'apply', u'approach', u'approximately', u'apps', u'april', u'area', u'areas', u'article', u'articles', u'artists']
[u'went', u'white', u'wide', u'win', u'wine', u'wines', u'winning', u'women', u'word', u'words', u'work

In [68]:
data_samples[958][0:615] #my R Shiny Project

'Taiwan Voting Data: Exploratory Visualization and Shiny Project - You will build an interactive Shiny application that can create visual representations...: You will build an interactive Shiny application that can create visual representations of data insights and trends.Visit my project here: : This was the first project I did at the YC Data Science Academy (Fall Cohort 2018). I browsed my way to  where they store a large variety of publicly shared data. I wanted to select a data set suitable for charting and mapping purposes. I also preferred to work with data which I had to translate from Mandarin Chinese'

In [9]:
#Now we start to fit the LDA model:
n_topics = 10

lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online', learning_offset=50.,
                                random_state=0)

lda.fit(tf)

#lda.components_ 

#lda.components_.shape



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_components=10, n_jobs=1, n_topics=10, perp_tol=0.1,
             random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [83]:
#lda.components_ 
lda.components_.shape

(10L, 1000L)

In [10]:
n_top_words = 20
print("\nTopics in LDA model:\n") #print_top_words
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

#each post will fall into a topic, read the topic words to see what the post might be about!
#FYI, without stopwords: Topic #0: that we this on it for but are as movies user more be reviews with by there analysis rate can


Topics in LDA model:

Topic #0:
data science learning work job time use project academy people like yc using bootcamp python students machine really want just

Topic #1:
reviews rating ratings movie shows wine average movies music data review score number user artists complaints scores songs different higher

Topic #2:
data number time year years countries city ew population york world map people states state analysis country different shows dataset

Topic #3:
data app shiny user time information number map code based different tab health used project average job application chart salary

Topic #4:
price market sales prices fig airlines flight brands project category number brand average popular categories industry shot days apps projects

Topic #5:
stock games data performance analysis game correlation time model number used plot companies trading significant variables news stocks values words

Topic #6:
data crime school schools loan education rates car rate high house loans student

In [96]:
text = data_samples[958]

print text[0:615]
print '-'*88
print lda.transform(tf[958,:])

Taiwan Voting Data: Exploratory Visualization and Shiny Project - You will build an interactive Shiny application that can create visual representations...: You will build an interactive Shiny application that can create visual representations of data insights and trends.Visit my project here: : This was the first project I did at the YC Data Science Academy (Fall Cohort 2018). I browsed my way to  where they store a large variety of publicly shared data. I wanted to select a data set suitable for charting and mapping purposes. I also preferred to work with data which I had to translate from Mandarin Chinese
----------------------------------------------------------------------------------------
[[1.36030417e-01 1.96982420e-02 7.70709540e-02 7.65025867e-01
  3.62418627e-04 3.62425841e-04 3.62396375e-04 3.62411829e-04
  3.62429431e-04 3.62438249e-04]]


### Top 3 Matches for my blog post "Taiwan Voting Data":
- Topic #3: 0.765 : data app shiny user time information number map code based different tab health used project average job application chart salary

- Topic #0: 0.136 : data science learning work job time use project academy people like yc using bootcamp python students machine really want just

- Topic #2: 0.077 : data number time year years countries city ew population york world map people states state analysis country different shows dataset

### Further study: Text Classification
- LDA topics contain words like "shiny" and "scraping". Might it be possible to use the topics to group unknown articles into categories?
- Either that or I would use Text Classification (sentiment analysis but with categories such as R, Python, Web Scraping, R Shiny). This would be a supervised learning method to teach the computer to recognize different blog post categories. 

In [50]:
textclass = []
for i in range(0,1215): #1216   
    #text = data_samples[i]
    #print text[0:615]
    #print '-'*88
    a = lda.transform(tf[i,:])
    b = a.tolist()
    textclass.extend(b)

with open('csvfile.csv','wb') as file:
    file.writelines(["%s\n" % item  for item in textclass])
#a = lda.transform(tf[i,:])
#type(a)
#a.tolist()[0][0]
#a.tolist()[0][1]
#a.tolist()[0][2]
#a.tolist()[0][3]
#a.tolist()[0][4]
#a.tolist()[0][5]
#a.tolist()[0][6]
#a.tolist()[0][7]
#a.tolist()[0][8]
#a.tolist()[0][9]
