<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Web APIs & NLP

## Problem Statement

The purpose of this project is to comparing human answer and ai answer by applying 

NLP to train a classifier on whether a response came from a human being or ChatGPT.

question and Human answer was collecting from Reddit by using PRAW, ai answer was collecting from chat GPT 

### Contents:
- [Background](#Background)
- [Data Cleaning and EDA](#Data-Cleaning-and-EDA)
- [Preprocessing and Modeling](#Preprocessing-and-Modeling)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

## Background

### Dataset

* [`project3_answer.csv`]('../data/project3_answer.csv'): Human answer and AI answer

**Brief description of the contents for each dataset.**



## Data Cleaning and EDA

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

In [2]:
df = pd.read_csv('../data/project3_answer.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,answer,result
0,0,It's all I have,1
1,1,3 months minimum and I'd watch it\n\nThanks fo...,1
2,2,"“I recognize the council has made a decision, ...",1
3,3,what about subs that crosspost from other subs...,1
4,4,Found this after accidentally losing my place ...,1


In [4]:
df.drop(columns='Unnamed: 0', inplace=True)
df.head()

Unnamed: 0,answer,result
0,It's all I have,1
1,3 months minimum and I'd watch it\n\nThanks fo...,1
2,"“I recognize the council has made a decision, ...",1
3,what about subs that crosspost from other subs...,1
4,Found this after accidentally losing my place ...,1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10924 entries, 0 to 10923
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   answer  10924 non-null  object
 1   result  10924 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 170.8+ KB


In [6]:
df.isnull().sum()

answer    0
result    0
dtype: int64

In [7]:
df['result'].value_counts(normalize=True)

1    0.5
0    0.5
Name: result, dtype: float64

In [8]:
X = df['answer']
y = df['result']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42, stratify=y)

In [11]:
y_train.value_counts(normalize=True)

1    0.500061
0    0.499939
Name: result, dtype: float64

In [22]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('bnb', BernoulliNB())
])

In [10]:
from nltk.corpus import stopwords
nltk_stop = stopwords.words('english')

In [11]:
pipe_params = {
    'cvec__max_features' : [2500, 5000],
    'cvec__min_df' : [3, 5],
    'cvec__max_df' : [0.9, 0.95],
    'cvec__ngram_range' : [(1,1), (1, 2)],
    'cvec__stop_words' : ['english', None, nltk_stop]
}

In [25]:
gs = GridSearchCV(pipe, pipe_params, cv=5)

In [26]:
gs.fit(X_train,y_train)

In [27]:
gs.best_score_

0.8828266439004693

In [28]:
gs.score(X_train,y_train)

0.8958867325765898

In [29]:
gs.score(X_test,y_test)

0.8897839619187111

In [30]:
gs.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 5000,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 2),
 'cvec__stop_words': None}

In [None]:
# LogisticRegression 

In [14]:
pipe_2 = Pipeline([
    ('cvec', CountVectorizer()),
    ('logr', LogisticRegression(max_iter=1000))
])

In [15]:
gs_2 = GridSearchCV(pipe_2, pipe_params, cv=5)
gs_2.fit(X_train,y_train)

In [16]:
gs_2.score(X_train,y_train)

0.9827901867447821

In [17]:
gs_2.score(X_test,y_test)

0.9128524350054925

In [18]:
gs_2.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 5000,
 'cvec__min_df': 5,
 'cvec__ngram_range': (1, 2),
 'cvec__stop_words': None}

In [22]:
pipe_3 = Pipeline([
    ('cvec', CountVectorizer()),
    ('ss', StandardScaler(with_mean=False)),
    ('knn', KNeighborsClassifier())
])

In [23]:
gs_3 = GridSearchCV(pipe_3, pipe_params, cv=5)
gs_3.fit(X_train,y_train)

In [24]:
gs_3.score(X_train,y_train)

0.7060905651165629

In [25]:
gs_3.score(X_test,y_test)

0.6155254485536433

In [26]:
gs_3.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 2500,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 2),
 'cvec__stop_words': None}

In [35]:
pipe_4 = Pipeline([
    ('cvec', CountVectorizer(max_df=0.9,max_features=5000,min_df=5,ngram_range=(1,2))),
    ('logr', LogisticRegression(penalty='l1',solver='liblinear',max_iter=1000))
])

In [37]:
pipe_4.fit(X_train,y_train)

In [39]:
pipe_4.score(X_train,y_train)

0.968509703405346

In [40]:
pipe_4.score(X_test,y_test)

0.913584767484438

In [44]:
pipe_5 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('logr', LogisticRegression())
])

In [52]:
pipe_params_2= {
    'tvec__max_features' : [2500, 5000],
    'tvec__stop_words' : [None, 'english', nltk_stop],
    'tvec__ngram_range' : [(1, 1), (1, 2)],
    'logr__penalty' :['l1','l2'],
    'logr__solver' : ['liblinear'],
    'logr__max_iter' : [1000]
}

In [53]:
gs_5 = GridSearchCV(pipe_5,pipe_params_2,cv=5)
gs_5.fit(X_train,y_train)

In [54]:
gs_5.score(X_train,y_train)

0.9325033565238618

In [55]:
gs_5.score(X_test,y_test)

0.9036982790186745

In [56]:
gs_5.best_params_

{'logr__max_iter': 1000,
 'logr__penalty': 'l2',
 'logr__solver': 'liblinear',
 'tvec__max_features': 5000,
 'tvec__ngram_range': (1, 2),
 'tvec__stop_words': None}