# Final Report - Team 2 NLP Project

Ella Xu, Jerry Nolf, Matthew Luna, Nathan Sharick - Innis Cohort

---

### Project Description:

- Most of the code hosting platforms for opensource projects consider the README file as the project introduction. As it is the first document seen by the reader, such a document needs to be crafted with care. The goal of this project is to predict the programming language for 100 repository by scraping, analyze the repository's README file contents. Using these datasets from 100 README's we were able to predict what programming language was used based on the composition of the README text.

### Project Goal:

- The goal of this project was to build a classification model that can predict the programming language of a repository based on the text of the repository's README.md file. 

### Imports

In [1]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import unicodedata
import re
import json
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
import warnings
warnings.filterwarnings('ignore')
import matt_prepare
import visualization
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

IndentationError: expected an indented block (visualization.py, line 15)

### Data Acquisition, Data Cleaning, and Data Preparation

- Web scraping methods were used to create a list of github username/repositories that included more than 300 repositories. The repositoriy names were pulled from github's top trending repositories, most forked repositories, and most stared repositories as of 05/13/ 2022.

- The list of repositories was put into the acompanying acquire.py file which creates a list of dictionaries that includes the name of the repository, the programming language used in the repository, and the content of the readme file for each repository in the list, and saves it as a .json file. The .json file is required to reproduce this project with this notebook and can be created by saving the acquire.py file in your local repository and running 'python acquire.py' from the terminal.

- Once the .json file is saved in the local directory is can be puled into the notebook, cleaned and prepared using the clean_df function located in the matt_prepare.py file. This function performs the following cleaning/preparation actions:
    - It uses pandas to read the file into the notebook
    
    - It cleans the data by normalizing it, changing all words to lower case, and removing any characters that are not letters, numbers, or whitespace
    
    - It tokenizes the words in the readme content using a toktok tokenizer
    
    - It removes standard english stopwords from the readme content
    
    - It then outputs a dataframe with the following columns:
        - name of the repository
        
        - Programming language of the repository
        
        - The raw readme contents
        
        - The cleaned readme content
        
        - The cleaned readme content that has been stemmed using a PorterStemmer
        
        - The cleaned readme content that has been lemmatized using a WordNetLemmatizer
        
        - The character count for the readme content
        
        - The word count for the readme content
        
- The dataset is then split into train, validate, and test sets using the split_data function located in the matt_prepare.py file which returns a train set with 56%, validate set with 24%, and test set with 20% of the original data frame


In [None]:
#pull in, clean, and prepare the dataset using the clean_df function
df = matt_prepare.clean_df()
#view the first five rows of the returned dataframe
df.head()

In [None]:
#split the dataset into train, validate, and test using the split_data function
train, validate, test = matt_prepare.split_data(df)
#view the row and column counts of the split dataframes
train.shape, validate.shape, test.shape

---

### Data Exploration

- Data exploration comments.....

**Question 1:** What are the most common languages from the repos we explored?

In [None]:
visualization.top_languages(train)

***Takeaways: Javascript and Python are two most frequently used languages.***

**Question 2:** What are the most common words across TOP languages README?

In [None]:
visualization.word_cloud(train)

***Takeaways: There are popular words across top languages readme such as aligncenter, application, build, web, p....***

**Question 3:** Are there any relationship between char count and word count? 

In [None]:
visualization.char_word(train)

#### Statistical Testing (Correlation):


#### Hypothesis:

 - H0: There is no linear relationship between character count and word count of a READme.

 - Ha: There is a linear relationship between character count and word count of a READme.

A Pearson's r statistical test will allow us to verify our beliefs...

In [None]:
visualization.question3_stats(train)

***Takeaways: There is a positive relationship between character count vs word count.*** 

**Question 4:** Does the length of READme differ between languages?

In [None]:
visualization.char_count(train)

In [None]:
visualization.question4_stats(train)

***Takeaways: By Average, JavaScript has the longest character count across all languages, follows by TypeScript***

In [None]:
visualization.word_count(train)

***Takeaways: JavaScript has has the longest word count across all languages, follows by other and they python, compare word and character count,python has longer word count, but Typescript has longer character count.***

### Takeaways from Data Exploration

- Takeaway notes......

---

### Data Preparation for Modeling

- Before proceding with modeling the data from each of the split groups was assigned to variable that can be used with the vectorizers and the classification models during the modeling phase

- The cleaned and lemmatized readme contents from each of the split groups was assigned to an X variable for the group

- The engineered language classification column (all language identifers were either one of the top five from the dataset or assigned as 'other') from each split dataset was assigned to a y variable for the group

In [None]:
#assign the cleaned and lemmatized readme contents from the train set to the X_train variable
X_train = train.lemmatized
#assign the engineered language classification value from the train set to the y_train variable
y_train = train.top_five_languages
#assign the cleaned and lemmatized readme contents from the validate set to the X_val variable
X_val = validate.lemmatized
#assign the engineered language classification value from the validate set to the y_val variable
y_val = validate.top_five_languages
#assign the cleaned and lemmatized readme contents from the test set to the X_test variable
X_test = test.lemmatized
#assign the engineered language classification value from the test set to the y_test variable
y_test = test.top_five_languages

---

### Modeling

- Two types of vectorizers (CountVectorizer and TfidfVectorizer) were used to preprocess the data prior to running the classification modeling

- Over 100 models were developed using both sets of vectorized data, multiple classification model types (decision tree, logistic regression, random forest, etc) and a full range of hyperperameters for each model type were evaluated.

- The top eight models with their optimal hyperparameters and vectorized datasets are seen below

- Each model's

In [None]:
##modeling with unigrams## 

#create the CountVectorizer object
cv = CountVectorizer()
#fit the CountVectorizer with the train data and transform the train data
X_bow = cv.fit_transform(X_train)
#transform the validate data with the CountVectorizer
X_bow_val = cv.transform(X_val)

#create the TfidfVectorizer
tfidf = TfidfVectorizer()
#fit the TfidfVectorizer with the train data and transform the train data
X_tfidf = tfidf.fit_transform(X_train)
#transform the validate data
X_tfidf_val = tfidf.transform(X_val)

In [None]:
#create the decision tree classifier object with a max depth of 15 and a random state of 123 for reproducibility
tree = DecisionTreeClassifier(max_depth=15, random_state=123)
#fit the decision tree with the CountVectorizer train data
tree.fit(X_bow, y_train)
#calculate the accuracy score of the decision tree with countvectorizer train and validate data
tree.score(X_bow, y_train), tree.score(X_bow_val, y_val)

In [None]:
#create the decision tree classifier object with a max depth of 15 and a random state of 123 for reproducibility
tree = DecisionTreeClassifier(max_depth=15, random_state=123)
#fit the decision tree with the tfidfvectorizer train data
tree.fit(X_tfidf, y_train)
#calculate the accuracy score of the decision tree with tfidfvectorizer train and validate data
tree.score(X_tfidf, y_train), tree.score(X_tfidf_val, y_val)

In [None]:
#create the logistic regression classifier object with a C value of 0.05 and a random state of 123 for reproducability
#and fit it with the countvectorizer train data
lm = LogisticRegression(C=0.05, random_state=123).fit(X_bow, y_train)
#calculate the accuracy score of the model with countvectorizer train and validate data
lm.score(X_bow, y_train), lm.score(X_bow_val, y_val)

In [None]:
#create the logistic regression classifier object with a C value of 10 and a random state of 123 for reproducability
#and fit it with the tfidfvectorizer train data
lm = LogisticRegression(C=10, random_state=123).fit(X_tfidf, y_train)
#calculate the accuracy score of the model with tfidfvectorizer train and validate data
lm.score(X_tfidf, y_train), lm.score(X_tfidf_val, y_val)

---

In [None]:
#modeling with unigrams and bigrams
cv = CountVectorizer(ngram_range=(1,2))
X_bow = cv.fit_transform(X_train)
X_bow_val = cv.transform(X_val)

tfidf = TfidfVectorizer(ngram_range=(1,2))
X_tfidf = tfidf.fit_transform(X_train)
X_tfidf_val = tfidf.transform(X_val)

In [None]:
tree = DecisionTreeClassifier(max_depth=20, random_state=123)
tree.fit(X_bow, y_train)
tree.score(X_bow, y_train), tree.score(X_bow_val, y_val)

In [None]:
tree = DecisionTreeClassifier(max_depth=10, random_state=123)
tree.fit(X_tfidf, y_train)
tree.score(X_tfidf, y_train), tree.score(X_tfidf_val, y_val)

In [None]:
lm = LogisticRegression(C=0.1, random_state=123).fit(X_bow, y_train)
lm.score(X_bow, y_train), lm.score(X_bow_val, y_val)

In [None]:
lm = LogisticRegression(C=500, random_state=123).fit(X_tfidf, y_train)
lm.score(X_tfidf, y_train), lm.score(X_tfidf_val, y_val)

---

**Best Model with the Test Dataset**

In [None]:
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(X_train)
X_tfidf_val = tfidf.transform(X_val)
X_tfidf_test = tfidf.transform(X_test)
lm = LogisticRegression(C=10, random_state=123).fit(X_tfidf, y_train)
lm.score(X_tfidf, y_train), lm.score(X_tfidf_val, y_val), lm.score(X_tfidf_test, y_test)

---

In [None]:
#Best model with test dataset

### Results from Modeling

- Notes on modeling results .....

---

### Summary

- Summary notes .....

### Next Steps

- Notes for next steps ....