# Introduction

For this project, the client is the Chancellors office for state community colleges, we'll call them CCN, Community Colleges of Nowhere. Every year, the colleges of CCN will submit annual reports to the Chancellors office, CO, detailing how they have spent money allocated to them from the state, objectives, and strategies. Currently, these reports are being read by someone at a consulting firm, who aggregates and summarizes their findings into an executive summary that is given to the CO. For the CO, the goal is to use Natural Language Processing, NLP, to reproduce parts of this report, such as main themes and trends, as well as give them a high level pictures of the groups of colleges that are focused on similar themes. 

In an attempt to get the best idea of the main topics and the similarities between them, I tried three different methods, the notebooks to each are linked:

   </br>[TFIDF and KMeans Clustering](tfidf_kmeans.ipnyb)
   
   </br>[Doc2Vec with Gensim and KMeans Clustering](doc2vec_attemp.ipnyb)
   
   </br>[Topic Modeling with LDA](topic_modeling.ipynb)

## The Data: Annual Reports

Annual reports are submitted every year by each institution. They include four narrative fields, an over all description, and then three reponse narratives. There are 71 in total, with about 700 words each. 

In addition to these documents there are also proposals that are submitted by institutions that have a project in mind for the funding. They are similar to annual plans but are required for different funding types. There are many more of these proposals, around 5,000. Since they have similar verbage, I was curious to see what kind of topics would come out of a larger dataset. 

## Preprocessing

<img src='NLP_preprocess.png'>

Most of the preprocessing was about the same. Punctuation was removed from the documents to form tokens, which were converted into lower case. The stop words were a combination of the english stopword package from NLTK, and a list that I added to as I iterated over the processes. I chose to only select nouns, verbs, and adjectives because those words held the most meaning the data set I was using. Stemming was used to reduce all words to stems, so that education, educated, and educating would all be known as the same word. 

## TFIDF and KMEANS Clustering

Top terms per cluster:

_Cluster 0 words:_ **'classes'**, 'sites', **'adults'**, 'team', 'well', 'instruction', **'building'**, 'activities', 'last', 
$'disabilities'$, 'help', 'exploration', **'noncredit'**, 'progress', 'market', 'project', 'original', 'basic', 'groups', 'consist',

_Cluster 0 instituions:_ South Bay (El Camino), South Bay (Southwestern), Butte-Glenn, Mid Alameda County (Chabot-Las Positas), Southeast Los Angeles, College of the Canyons, San Francisco, Monterey, Southern Alameda County (Ohlone), Salinas Valley, South Bay (San Jose Evergreen), Morongo Basin,

_Cluster 1 words:_ 'aep', 'three-year', 'fiscally', 'board', 'ensure', 'engage', 'program', 'county', b"'s", 'across', 'approved', **'academic'**, 'gaps', 'managed', 'comprehensive', **'noncredit'**, **'adults'**, 'make', 'approaches', 'ongoing',

_Cluster 1 instituions:_ San Bernardino, Victor Valley, South Orange, Ventura County, Feather River, Contra Costa, Santa Monica, Barstow,

_Cluster 2 words:_ 'county', 'training', **'classes'**, 'skills', '3-year', 'learners', 'high', 'certificates', 'well', **'local'**, 'reviewed', 'within', $'diploma'$, **'adults'**, 'agencies', 'instruction', 'serves', 'market', 'current', 'learning',

_Cluster 2 instituions:_ Gavilan, Gateway (Merced), Sierra Joint, Imperial, Los Angeles, Mendocino-Lake, Rancho Santiago, North Coast, ACCEL (San Mateo), Marin, Northern Alameda County (Peralta), Napa Valley, Palo Verde, San Diego, Santa Cruz, Riverside About Students, Siskiyous, Kern, Stanislaus Mother Lode (Yosemite), Lassen, West Hills, West Kern, Sonoma, Capital (Los Rios), Pasadena, Shasta-Tehama-Trinity,

_Cluster 3 words:_ 'saec', 'objectives', 'committee', **'academic'**, 'access', 'strategic', **'building'**, 'retreat', 'program', 'staff', $'counseling'$, 'support', 'seamless', 'current', 'within', 'jo', 'first', 'board', 'county', 'assist',

_Cluster 3 instituions:_ Sequoias, Solano,

_Cluster 4 words:_ '3-year', 'created', 'initiated', 'three-year', 'progress', 'made', 'leveraging', 'staff', 'stakeholders', 'outcome', 'address', 'local', 'engage', 'effectiveness', 'reviewed', 'activities', 'ensure', 'san', 'approved', 'learning',

_Cluster 4 instituions:_ Desert, North Orange, Lake Tahoe, Mt. San Antonio, Coast, San Diego East (Grossmont-Cuyamaca), Rio Hondo, Coastal North, Long Beach, Foothill De Anza, Allan Hancock, State Center, Citrus, Santa Barbara, Southwest Riverside, Delta Sierra Alliance, San Diego North (Palomar), Tri-Cities, Antelope Valley, Glendale, West End Corridor, San Luis Obispo, North Central (Yuba),


## Gensim Doc2Vec and KMEANS

<img src='d2v_cluster.png'>

Cluster 0 words: 'fiscally', b"'s", 'focus', **'gaps'**, **'address'**, **'curriculum'**, **'allocate'**, **'goals'**, 'high', **'partnerships'**, 'esl', 'program', 'learners', **'basic'**, 'data', 'workforce', 'education', 'initiated', 'skills', 'community',

Cluster 0 instituions: Desert, South Bay (Southwestern), San Bernardino, Lake Tahoe, Mt. San Antonio, Victor Valley, Gateway (Merced), Mid Alameda County (Chabot-Las Positas), San Diego East (Grossmont-Cuyamaca), Rio Hondo, Los Angeles, Mendocino-Lake, Ventura County, Coastal North, Southeast Los Angeles, College of the Canyons, North Coast, Allan Hancock, State Center, ACCEL (San Mateo), Northern Alameda County (Peralta), Palo Verde, San Diego, Riverside About Students, Siskiyous, Delta Sierra Alliance, Stanislaus Mother Lode (Yosemite), Santa Monica, Lassen, Tri-Cities, Salinas Valley, Glendale, West End Corridor, Sonoma, Pasadena, Morongo Basin, Barstow,

Cluster 1 words: 'fiscally', b"'s", **'goals'**, **'address'**, **'gaps'**, **'allocate'**, **'curriculum'**, **'partnerships'**, **'basic'**, 'esl', 'focus', 'workforce', 'high', 'program', 'data', 'community', 'approved', 'education', 'leveraging', 'skills',

Cluster 1 instituions: Butte-Glenn, Coast, Solano, Long Beach, Southern Alameda County (Ohlone), Antelope Valley, San Luis Obispo, North Central (Yuba),

Cluster 2 words: 'fiscally', **'goals'**, **'allocate'**, **'address'**, **'gaps'**, **'basic'**, **'curriculum'**, **'partnerships'**, 'workforce', 'approved', 'community', b"'s", 'esl', 'data', **'leveraging'**, 'board', 'college', 'education', 'program', 'areas',

Cluster 2 instituions: South Bay (El Camino), Gavilan, Sierra Joint, Imperial, Rancho Santiago, Marin, Napa Valley, San Francisco, Santa Cruz, Kern, Monterey, West Hills, West Kern, Capital (Los Rios), Shasta-Tehama-Trinity,

Cluster 3 words: 'fiscally', b"'s", **'gaps'**, **'address'**, **'goals'**, **'allocate'**, **'curriculum'**, **'partnerships'**, **'basic'**, 'focus', $'esl'$, 'workforce', 'high', 'data', 'program', 'education', 'community', 'approved', 'learners', 'areas',

Cluster 3 instituions: North Orange, South Orange, Feather River, Sequoias, Foothill De Anza, Contra Costa, Citrus, Santa Barbara, Southwest Riverside, San Diego North (Palomar), South Bay (San Jose Evergreen),

## LDA Topic Modeling

With a much larger data set, specific industries and subjects start to come through. As you hover over the clusters, you'll see many of the same words as before, but also words like 'technology', 'renewable energy', 'engineering', 'accounting', 'outreach', etc. Some of the clusters, like 20, even have a common theme: automotive/mechanical. 

In [1]:
from IPython.display import HTML
HTML(filename='lda_vis.html')

### Final Thoughts

I see two main possible conclusions, the first, is the plans actually are that similar. If there are specific goals associated with the funding they are reporting on (which there is), then its likely that we will see a lot of similarities around those areas. The fund that these plans are associated with is specific to adult education and training adults to fill gaps in the word force. Thats why 'basic', 'training', 'gaps' are all very common. To find more specific words, I could add these common words to the stop words list, or add a lower maximum frequency limit to filter out these common words. 

The other conclusion, which is not mutually exclusive from the former, is that this is just too small of a data set. I would like to find away to train up a model using other educational text that could better link words together in an a multidemensional space. 

As far as replacing the current form of summarizing these documents, none of these models are quite there. Although it does provide high level overview of the themes within the documents, it does not give enough specifics to be useful for any kind of decision making. 