# NLP Final Project - Thomas Guardi

__Illinois is famous for being one of the very few states in the country with negative population growth.  The objective of your final project is to:__

1. Identify the key reasons for the declining population by extracting meaningful insights from unstructured text

2. Provide actionable recommendations on what can be done to reverse this trend

## **Main Goals**

Clean-up the noise (eliminate articles irrelevant to the analysis)
Detect major topics

Identify top reasons for population decline (negative sentiment)
Suggest corrective actions

Demonstrate how the city / state can attract new businesses (positive sentiment)

Leverage appropriate NLP techniques to identify organizations and people and apply targeted sentiment

Why businesses should stay in IL or move into IL?

Create appropriate visualization to summarize your recommendations (i.e. word cloud chart or bubble chart)

Why residents should stay in IL or move into IL?

Create appropriate visualization to summarize your recommendations (i.e. word cloud chart or bubble chart)


### **Additional Guidance**

Default sentiment will likely be wrong from any software package and will require significant tweaking

Either keyword / dictionary approach or

Labeling and classification

You are encouraged to explore a combination several techniques to identify key topics:

Topic modeling (i.e. LSA, LDA and TF-IDF)

Classification (hand-label several topics on a sample and then train classifier)

Clustering (cluster topics around pre-selected keywords or word vectors)

#### **Powerpoint Guidelines**
Please limit your work to 7 PowerPoint slides. 
On your slides you will want to provide:

Executive Summary

Methodology and source data overview

Actionable recommendations

Please submit your actual program codes (i.e. Python Notebook) along with your PowerPoint – as a separate attachment

Your presentation should be targeted toward business audience and must not contain any code snippets

You are welcome to use any software packages of your choice to complete the assignment
 

In [1]:
from multiprocessing import cpu_count
print(cpu_count())

2


In [1]:
# pip install list
# !pip install pyLDAvis

In [2]:
# always these guys
import numpy as np
import pandas as pd
# for reading in the json
import json
# because you're gonna need to do regex
import re
import nltk as nltk
import matplotlib.pyplot as plt
import pickle
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

import warnings
warnings.simplefilter('ignore')

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
pwd

'/home/tguardi/Classes/NLP/Final_Project'

In [5]:
# Colab path
# path = /content/drive/MyDrive/
# local path
path = '/home/tguardi/Classes/NLP/Final_Project/news_chicago_il.json'
# rcc path
# directory = '/project2/msca/kadochnikov/news/'
# file = 'news_chicago_il.json'
# path = directory + file

In [5]:
articles = []
for line in open(path,'r'):
    articles.append(json.loads(line))

In [7]:
df = pd.DataFrame.from_dict(articles,orient='columns')
df.shape

(371788, 4)

In [8]:
df.head(5)

Unnamed: 0,crawled_date,language,text,title
0,1589155200000,english,\nGov. Jay “Fatso” Pritzker called on all Illi...,All In Illinois
1,1589155200000,english,"May 10, 2020 -The Illinois Department of Publi...",The Illinois Department of Public Health Annou...
2,1589155200000,english,"Gloria Lawrence said: May 10, 2020 at 1:31 AM\...","Foto Friday: Alton, Illinois"
3,1589155200000,english,NBA to follow German soccer league model with ...,Chris Broussard on Michael Jordan returning to...
4,1589155200000,english,"Search Minggu, 10 Mei 2020 Pork chops vs. peop...",Pork chops vs. people: Can Americans’ appetite...


# Text Cleaning

- All articles in English

In [9]:
%time df['text_clean'] = df['text'].map(lambda x: re.sub(r'\n', '.  ', str(x)))

CPU times: user 2.61 s, sys: 1.81 s, total: 4.42 s
Wall time: 4.42 s


# **LDA**

In [10]:
import gensim
from gensim import corpora, models
from gensim.models.ldamulticore import LdaMulticore
import pyLDAvis.gensim

In [11]:
# with open('doc_clean.pkl', 'rb') as f:
#     doc_clean = pickle.load(f)

In [1]:
import joblib

In [2]:
dictionary = joblib.load('dict_doc.jl')

In [None]:
doc_term_matrix = joblib.load('doc_term_mx.jl')

In [None]:
numtopics = 10
%time ldamodel = LdaMulticore(doc_term_matrix, num_topics=numtopics, id2word = dictionary, passes=50)
print(*ldamodel.print_topics(num_topics=5, num_words=3), sep='\n')