# CIS09 Intro to Data Science Final Project

*Name: Maria Gorbunova, Yueqi Wang *

# **1. Project Description**

Learn from 60,000 questions collected at Stack Overflow from 2016 to 2020, create models to label the quality of questions into three categories. 

    1.	HQ: High-quality posts with a total of 30+ score and without a single edit.
    2.	LQ_EDIT: Low-quality posts with a negative score, and multiple community edits. However, they still remain open after those changes.
    3.	LQ_CLOSE: Low-quality posts that were closed by the community without a single edit.

Study the characteristics of good questions on Stack Overflow and pick the features that have the strongest correlation with the categories. Train several models to predict the category of the post. Choose the best model.


# 2. Import main packages and data

In [None]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

import nltk
from nltk.tokenize import RegexpTokenizer

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv("../input/60k-stack-overflow-questions-with-quality-rate/train.csv")
valid = pd.read_csv("../input/60k-stack-overflow-questions-with-quality-rate/valid.csv")
train.head()

# 3. Preprocessing data and generate additional features 

a.	Get number of words in body

In [None]:
# change column names to lower:
train.columns = train.columns.str.lower()

# remove <p> from Body apply to train DataFrame 
train.body = train.body.str.replace('<p>','')
# count words by splitting the string by space
num_words_in_body = pd.Series([len(row.split(' ')) for row in train.body])
# add the count to train dataFrame
train["num_words_in_body"] = num_words_in_body
#show the data
train["num_words_in_body"].head()

b.	Get number of words in title

In [None]:
# count words by splitting the string by space
num_words_in_title = pd.Series([len(row.split(' ')) for row in train.title])
# add the count to train dataFrame
train["num_words_in_title"] = num_words_in_title
train["num_words_in_title"].head()

c.	Get number of tags per question

In [None]:
# extract words from tags
num_of_tags = pd.Series([len(re.findall('<(\w+)>', row)) for row in train.tags])
# add the count to train DataFrame
train["num_of_tags"] = num_of_tags
train["num_of_tags"].head()

d.	Extract tags for coding language associated with the question

In [None]:
# get a list of tags from the dataFrame Tags column
tags_list = []
for row in train.tags:
    #print(row)
    for tag in re.findall('<(.*?)>', row):   # if this is python-2.7 or python-3.x in the tag, it will be counted as different tags
        tags_list.append(tag)
        #print(tag)
        
# store unique tags into a tags_set
tags_set = set(tags_list)

# use the nltk package to count the frequency of each tag
tags_freqD = nltk.FreqDist(tags_list)

# sort the dictionary by the count, in descending order
sorted_tagsD = sorted(tags_freqD.items(), key=lambda item:item[1], reverse=True)

'''
# evaluate the result, use "python" to test out
for k,v in sorted_tagsD:
    if "c" in k:    
        print(k, v)'''
print()

In [None]:
# top10_tags are the most used tags in the dataset
top10_tags = [tag[0] for tag in sorted_tagsD[:10]]
top10_tags

e.	Extract year of the question

In [None]:
# extract year from CreationDate
CreationYear = pd.Series([date[:4] for date in train.creationdate])
# add CreationYear to train DataFrame
train["creation_year"] = CreationYear
train["creation_year"].head()

f. replace <> from Tags columns with space

In [None]:
train.tags = train.tags.str.replace('<|>',' ')
#train.Tags = train.tags.str.replace('>',' ') # used | - or operator
train.tags.head()


g. print the layout of train dataFrame after preprosessing

In [None]:
train.head()

# 4. Data Visualization 

a.	Visualize data by “y”, the question quality label, verify if data is balanced between each quality category

In [None]:
labels = train.y.unique()
train.groupby('y').size()
plt.bar(labels, train.groupby('y').size())
plt.xlabel("Quality Labels")
plt.ylabel("Total questions in Train dataset")

from the bar chart above, we can see the train data is well balanced with equal amount of each qaulity questions in the dataset

b.	Plot total questions per year (Plot total questions by label by year? )

In [None]:
years = train.creation_year.unique()
plt.bar(years, train.groupby('creation_year').size())
plt.xlabel("Year")
plt.ylabel("Total questions in year")

As shown above, there are less questions comparing to previous years. 
It would be interesting to check out the qaulity of questions among each year in the train dataset

In [None]:
years  # unique years in dataset
labels # unique labels in dataset
year_label_df = pd.DataFrame(columns=labels, index=years, data= [[sum((train.y == label)&(train.creation_year== year)) for label in labels] for year in years ])
print(year_label_df)
year_label_df.plot.barh()

- There are fewer HQ (high quality) questions as the total quesion number goes down
- Meanwhile, the decrease of LQ (low qaulity) questions are not dropping as dramatically as the HQ questions. 

c.	Plot frequency of top 10 used tags in Stack Overflow

In [None]:
tags_freqD.plot(10)
### i think we should use bar plot here

d.Plot number of HQ, LQ posts for top 10 tags

In [None]:
labels     # unique labels in dataset
top10_tags # top 10 tags used in the dataset
# train[train['Tags'].str.contains('\\b(c)\\b', regex=True)].Tags


tag_label_df = pd.DataFrame(columns=labels, index=top10_tags, data= [[sum((train.y == label)&(train['tags'].str.contains(tag, regex=True))) for label in labels] for tag in top10_tags ])
print(tag_label_df)
tag_label_df.plot.bar()

e.	Plot correlation between features and quality label
    -	length of title vs quality
    -	length of body text vs quality
    -	number of tags vs quality
    -	year vs quality

In [None]:
sns.pairplot(data=train, y_vars=["y"], x_vars=["num_words_in_title", "num_words_in_body", "num_of_tags", "creation_year"])

# 5. Build Classification Models With Features

a.	DecisionTreeClassifier

b.	KNN classifier

c.	naïve_bayes: GaussianNB

# 6. Build NLP Model

a.	Tokenize the words (combine tag and text )

b.	Removes stop words

c.	Get frequency distribution and find out: * (may have to limit to certain years due to capacity) 
    - What are the commonly used words in high rating posts?
    - What are the commonly used words in low rating posts? 

d.	Stemming words

e.	Transform words into document vector by CountVectorizer

f.	Use multinomialNB to create model

# 7. Models Evaluation and Final Conclusion

a.	Which model is better for the prediction? Why do we think it performs better than the other one?

b.	What are the common characteristics of “good” questions?

c.	Room of improvement?