#### Here we will create the model. 
First we will preprocess the data and then based on the data we try different modules and check which one works best.

#### Import the required modules

In [None]:

try:
    import pandas as pd
    import nltk
    from nltk.corpus import stopwords #to get the english stopwords
    from nltk.stem import WordNetLemmatizer
    le=WordNetLemmatizer()
    import re
except:
    !pip install nltk
    nltk.download('wordnet')
    !pip install pandas

In [None]:
#import excel file
df=pd.read_csv(r"Data/train_file.csv")
df.head()

#### Let us segregate the data into train and test set.

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(df.drop('MaterialType',axis=1),df['MaterialType'],random_state=32,test_size=0.3)

In [None]:
print("Train data x shape :",x_train.shape)
print("Train data y shape :",y_train.shape)
print("Test data x shape :",x_test.shape)
print("Test data y shape :",y_test.shape)

After doing the null data check we see that ~1250 records do not have subject. Similar check is done on the test data also.

In [None]:
x_train.isnull().sum()

#### Update the blank data rows with space and then concatenate the Title and Subjects column.

In [None]:
x_train.Subjects.fillna(" ",inplace=True)
x_test.Subjects.fillna(" ",inplace=True)

In [None]:
#concatenate the columns title and subjects
x_train['text']=x_train['Title']+" " + x_train['Subjects']
x_test['text']=x_test['Title']+" " + x_test['Subjects']

Now we see that there are no null values for Subjects column

In [None]:
x_train.isnull().sum()

In [None]:
x_test.isnull().sum()

Since we know that the data is not actually balanced hence we can use weights on the data so that the minority class is also predicted accurately.

In [None]:
# checking if the data is balanced
y_test.value_counts()/y_test.shape[0]

The preprocess function preprocess the data and make it ready for ingestion by the model. Below preprocessing steps are followed:
- Keep only alphabetical data. Numbers and punctuations are removed.
- Convert the data to lower case.
- Remove the frequently occuring stopwords.
- Convert each word in the corpus into its lemma form.

In [None]:
def preprocess(data):
    cleaned_data=[]
    for i in data:
        text=re.sub('[^A-Za-z]',' ',i) #remove punctuations
        text=text.lower() #convert to lower case
        text=" ".join([le.lemmatize(word) for word in text.split() if not word in stopwords.words('english')])#using stemmer to stem words and remove stopwords
        cleaned_data.append(text)
    return cleaned_data
        

In [None]:
#use preprocess function to do preprocessing of the data
x_train['cleaned_text']=preprocess(x_train['text'])

In [None]:
x_test['cleaned_text']=preprocess(x_test['text'])

In [None]:
x_train.cleaned_text.shape

Convert the cleaned text data column into TFIDF matrix. Now the data is ready to be ingested by the models.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer(max_features=5000)
x_train_tfidf=tfidf.fit_transform(x_train.cleaned_text)
x_test_tfidf=tfidf.transform(x_test.cleaned_text)

We will try to predict the materialType using the below models and then finalize on the most suitable one based on its performance . 
- Logistic Regression
- Random Forest Classifier
- Gradient Boosting Classifier
- Naive Bayes
- SVM

In [None]:
#initializing the model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
models={'Logistic Regression':LogisticRegression(),
        'Naive Bayes':MultinomialNB(),
        'SVM':SVC(),
        'Random Forest':RandomForestClassifier(),
        'Gradient Boosting':GradientBoostingClassifier()}

We will now check how the models perform without any hyper parameters.

In [None]:
from sklearn.metrics import precision_score,recall_score,accuracy_score,f1_score
from sklearn.model_selection import KFold
for nm,model in models.items():
    model.fit(x_train_tfidf,y_train)
    resp=model.predict(x_test_tfidf)
    acc=accuracy_score(resp,y_test)
    print(f"{nm} - Accuracy : {acc*100}%")
    

Now let us update the hyper parameters for each model and then use them to see if we can get better results.

**Note** : Logistic Regression model did not converge by default and hence we will try to make it converge by increasing the max_iter parameter.

In [None]:
hyper_param={'Logistic Regression':{'max_iter':[100,500,1000],'solver':['saga'],'penalty':['l1'],'C':[0.9]},
             'Naive Bayes':{},
             'SVM':{'C': [1, 10], 'kernel': ['linear', 'rbf'],'class_weight':['balanced',None]},
             'Random Forest':{'n_estimators':[100,200,400],'class_weight':['balanced','balanced_subsample']},
             'Gradient Boosting':{'n_estimators':[100,200,400]}}   

In [None]:
from sklearn.model_selection import GridSearchCV
for nm,model in models.items():
    clf=GridSearchCV(model,hyper_param[nm],refit=True)
    clf.fit(x_train_tfidf,y_train)
    resp=clf.predict(x_test_tfidf)
    acc=accuracy_score(resp,y_test)
    print(f"Model Used : {nm}\nModel Parameters : {clf.best_estimator_}\nModel Accuracy : {acc*100}%")

- Even after updating the max_iter convergence issue did not go away.
- GradientBoosting Classifier also faced issue with model fitting. **(Feel free to update me resolve this error.)**

In [None]:
#since we can see that SVM and Randomforest deos the best job in identifying. Let us check their accuracy and precission score also
from sklearn.metrics import precision_score,recall_score,accuracy_score,f1_score,confusion_matrix,classification_report
for nm,clf in models.items():
    if nm=='SVM' or nm=='Random Forest':
        clf=GridSearchCV(clf,hyper_param[nm],refit=True)
        clf.fit(x_train_tfidf,y_train)
        ypred=clf.predict(x_test_tfidf)
        print(f"\n\nModel {nm} \nClassification Report : \n",classification_report(y_test,ypred))
#print("Recall Score : ",recall_score(y_test,ypred,average='macro'))
#print("Precission Score : ",precision_score(y_test,ypred,average='macro'))
#print("Confusionn Matrix",confusion_matrix(y_test,ypred))

###### From here we see that both are having similar accuracy. Random forest seems to work a bit better.
###### Future Improvement:
- **VideoCass** and **VideoDisk** seems to have lesser recall than the major classes like Book,Sound, Video.
- This can be further improved by doing the below:
    - Use data sampling technique to increase the data points for classes where minimal data is present.
    - Merge all these different categories into larger groups containing **Book**, **Sound** and **Video**. (Miscelleneous may be removed/ignored since the count is less than even 1%)