# <h><center>Feedback Prize - Predicting Effective Arguments</center></h>

<img src='https://www.incimages.com/uploaded_files/image/1920x1080/getty_506903004_200013332000928076_348061.jpg'>



# <center>**NLP Based Problem: Classification + TextData.....**</center>

## <center>**Goal of the Competition:**</center> 

The goal of this competition is to classify argumentative elements in student writing as "effective," "adequate," or "ineffective." You will create a model trained on data that is representative of the 6th-12th grade population in the United States in order to minimize bias. Models derived from this competition will help pave the way for students to receive enhanced feedback on their argumentative writing. With automated guidance, students can complete more assignments and ultimately become more confident, proficient writers.

## ***Continue the US pattern matching competition approach:***

1. https://www.kaggle.com/code/venkatkumar001/u-s-p-p-baseline-eda-dataprep
2. Starter1: https://www.kaggle.com/code/venkatkumar001/nlp-starter1-almost-all-basic-concept
3. Starter2: https://www.kaggle.com/code/venkatkumar001/nlp-starter2-hf-pretrain-finetune
4. https://www.kaggle.com/code/venkatkumar001/transformeranatomy-encoder

## ***Now, This feedback price competiton onward!***

1. Starter3: https://www.kaggle.com/venkatkumar001/nlpstarter3-baseline-approach


## ***So, I am trying to build baseline Approach of Competition data***

# **Steps:**

### **1. Import Necessary Library**

### **2. Load and analysis the data**

### **3. Preprocessing**

### **4. Feature selection**

### **5. Build the Model**

### **6. Predict Output**

### **7. Generate Submission file**



# <center>**Import Necessary Library**</center>

In [None]:
import pandas as pd
import numpy as np
import re
import os
import matplotlib.pyplot as plt
import seaborn as sns
import time
import datetime
from scipy import sparse


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import KFold
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
from sklearn.metrics import log_loss

from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB,MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score,recall_score

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from string import punctuation
from nltk.stem.wordnet import WordNetLemmatizer
from tqdm import tqdm

%matplotlib inline


# <center>**Load and Analysis the data**</center>

In [None]:
!ls '../input/feedback-prize-effectiveness'

In [None]:
train = pd.read_csv('../input/feedback-prize-effectiveness/train.csv')
test = pd.read_csv('../input/feedback-prize-effectiveness/test.csv')
sample = pd.read_csv('../input/feedback-prize-effectiveness/sample_submission.csv')
print(f'Train_Shape: {train.shape},Test_Shape: {test.shape},Sample_Shape: {sample.shape}')
display(train.sample(2))
display(test.sample(2))
display(sample.sample(2))

In [None]:
train.info()

In [None]:
train.describe(include='object')

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(x='discourse_id',data=train)

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(train['discourse_type'])

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(train['discourse_effectiveness'])

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(train['essay_id'])

In [None]:
train.discourse_type.unique()

## **Identify the imbalance of data - discourse_type**

In [None]:
#identify the imbalance one

def count_target(target_list):
    target_dict = {}
    for x in target_list:
        count = len(train['discourse_type'] == x)
        dict_t = dict({x:count})
        target_dict.update(dict_t)
    return target_dict

target_list = ['Lead', 'Position', 'Claim', 'Evidence', 'Counterclaim',
       'Rebuttal', 'Concluding Statement']
count_target(target_list)


# <center>**Preprocessing - NLTK**</center>

## **1. Clean the discourse_text attribute and generate clean attribute**

In [None]:
def cleanup_text(text):
    words = re.sub(pattern = '[^a-zA-Z]',repl = ' ', string = text)
    words = words.lower()
    return words

cleanup_text('VK, is beast mode in my NLP competition ')

In [None]:
text_preprocessed = train['discourse_text'].apply(cleanup_text)
text_preprocessed
train['text_preprocessed'] = text_preprocessed
train.sample(2)

In [None]:
test_preprocessed = test['discourse_text'].apply(cleanup_text)
test_preprocessed
test['text_preprocessed'] =  test_preprocessed
test.head()

## **2. Map the target data**

In [None]:
effectiveness_map = {'Ineffective' : 0, 'Adequate':1,'Effective':2}
train['target'] = train['discourse_effectiveness'].map(effectiveness_map)
train.sample(2)

## **3. Preprocessing - Apply one hot encoding in (discourse type) , TFIDF vectorize in (discourse text and cleaning text)**

In [None]:
tf = TfidfVectorizer(ngram_range=(1,2),norm='l2', smooth_idf=True)
train_discourse_tfidf = tf.fit_transform(train["discourse_text"])
test_discourse_tfidf = tf.transform(test["discourse_text"])
    

In [None]:
tf = TfidfVectorizer(ngram_range=(1,2),norm='l2', smooth_idf=True) # Load tf another time because it will learn the new vocabulary for 'text'
train_text_tfidf = tf.fit_transform(train["text_preprocessed"])
test_text_tfidf = tf.transform(test["text_preprocessed"])
    

In [None]:
#discourse_type
ohe = OneHotEncoder()
train_type_ohe =  sparse.csr_matrix(ohe.fit_transform(train["discourse_type"].values.reshape(-1,1)))
test_type_ohe =  sparse.csr_matrix(ohe.transform(test["discourse_type"].values.reshape(-1,1)))
        

## **4. Stack the all three preprocess data**

In [None]:
#Stack each vector representations 
train_tfidf = sparse.hstack((train_type_ohe,train_discourse_tfidf,train_text_tfidf))
test_tfidf = sparse.hstack((test_type_ohe,test_discourse_tfidf,test_text_tfidf))
    

In [None]:
train_tfidf

# <center>**Build the Model**</center>

## **1. Logistic Regression**

## **2. BernoulliNB**

## **3. SVM**

## **4. Boosting Methods**

## **1. LogisticRegression - First try** 

<img src='https://www.statisticalaid.com/wp-content/uploads/2021/05/tempsnip2.png'>

In [None]:
# clf1 = LogisticRegression(max_iter=500,penalty="l2",C=1.0131816333513533)
# clf1.fit(train_tfidf, train["target"].values)

## **2.BernoulliNB-Second try**

<img src='https://i.stack.imgur.com/e3KGO.png'>

In [None]:
# #Model
# clf2 = BernoulliNB()
# clf2.fit(train_tfidf, train["target"].values)

## **3. SVM-Third try**

<img src='https://www.researchgate.net/publication/304611323/figure/fig8/AS:668377215406089@1536364954428/Classification-of-data-by-support-vector-machine-SVM.png'>

In [None]:
# #Model
# clf3 = svm.SVC(decision_function_shape='ovo')
# clf3.fit(train_tfidf, train["target"].values)

## 4. **Boosting method - Make a good result**

<img src='https://cdn.educba.com/academy/wp-content/uploads/2019/11/bagging-and-boosting.png'>

# **CatBoost-Taken Long time to run**

In [None]:
# from catboost import CatBoostRegressor,CatBoostClassifier

# #cat
# catpara={
#         'learning_rate': 0.001152,
#         "max_depth": 3,
#         'random_state':42,
#         'n_estimators':1000
#     }
# cat = CatBoostClassifier(**catpara).fit(train_tfidf, train["target"].values,verbose=False)

## **XGBoost-Make good result lets try better one**

In [None]:
# from xgboost import XGBRegressor,XGBClassifier

# #Model hyperparameter of XGboostRegressor
# xgb_params = {
#         'learning_rate': 0.03628302216953097,
#         'subsample': 0.7875490025178,
#         'colsample_bytree': 0.11807135201147,
#         'max_depth': 3,
#         'booster': 'gbtree', 
#         'reg_lambda': 0.0008746338866473539,
#         'reg_alpha': 23.13181079976304,
#         'random_state':40,
#         'n_estimators':5000
        
        
#     }
    
# model= XGBClassifier(**xgb_params,
#                        tree_method='gpu_hist',
#                        predictor='gpu_predictor',
#                        gpu_id=0)
    
# model.fit(train_tfidf, train["target"].values)

## **LSTM**

In [None]:
#Model hyperparameter of XGboostRegressor
#lgb parameters
from lightgbm import LGBMRegressor,LGBMClassifier
import lightgbm as lgb

params_lgb = {
    "boosting_type": "gbdt",
    "objective": "multiclass",
    'subsample': 0.95312,
    'learning_rate': 0.001635,
    "max_depth": 3,
    'random_state':12,
    'n_estimators':15000,
    }
    


model1= LGBMClassifier(**params_lgb )
    
model1.fit(train_tfidf, train["target"].values)
    


# **Predict output - using Boosting Model**

In [None]:
test_predict = model1.predict_proba(test_tfidf)
test_predict

In [None]:
sample.loc[:,"Ineffective"] = test_predict[:,0]
sample.loc[:,"Adequate"] = test_predict[:,1]
sample.loc[:,"Effective"] = test_predict[:,2]
sample.to_csv('submission.csv',index=False)

In [None]:
print('successfully execute all!')
sample.head()

## ***Next Process Coming Soon...........Try to apply KFOLD***

## **"if you see any errors and your opinion! feel free to share with me"**

credit of this notebook: 

1. https://www.kaggle.com/code/chandraprajapati/feedback-prize-logistic-regression
2. https://www.kaggle.com/code/venkatkumar001/nlp-starter1-almost-all-basic-concept
3. https://www.kaggle.com/code/bhavikardeshna/logisticregression-feedback-price-effectiveness


## <center>⭐️⭐️Thanks for visiting guys⭐️⭐️</center>