# Problem Statement

The problem statement is to train the model on topics related to biology, cooking, robotics, travel, diy and crypto and use this model to predict the tags of topics related to Physics.

# Challenge

The challenge of this competition is that it violates the basic assumption of machine learning which is that the train and test data should come from the same distribution and here the training data and the test data are completely different from each other.

Since one title can belong to multiple categories at the same time this problem is a multilabel classification problem

# Solution Approach

On a high level the solution approach would look like this:

1. Data Ingestion
2. Data Cleaning
3. Data Preprocessing
4. Exploratory Data Analysis (EDA)
5. Feature Extraction
6. Model Training
7. Hyperparameter Tuning
8. Model Evaluation


# Import the necessary libraries

In [None]:
import os
import numpy as np
import pandas as pd
from collections import Counter
from sklearn import preprocessing
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
from scipy import sparse
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score,f1_score
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import joblib
import ast
import seaborn as sns
%matplotlib inline

In [None]:
os.chdir('../input/transfer-learning-on-stack-exchange-tags')

In [None]:
os.listdir()

# Data Ingestion

All the six data files biology, cooking, crypto, diy, robotics and travel were imported into a pandas dataframe and later all these six dataframes were appended into a single dataframe for analysis and model training

In [None]:
biology_data=pd.read_csv('biology.csv.zip')
cooking_data=pd.read_csv('cooking.csv.zip')
crypto_data=pd.read_csv('crypto.csv.zip')
diy_data=pd.read_csv('diy.csv.zip')
robotics_data=pd.read_csv('robotics.csv.zip')
travel_data=pd.read_csv('travel.csv.zip')
test_data=pd.read_csv('test.csv.zip')

In [None]:
test_data.head()

In [None]:
combined_data=pd.DataFrame()

In [None]:
combined_data=combined_data.append([biology_data,cooking_data,crypto_data,diy_data,robotics_data,travel_data])

In [None]:
combined_data.shape

In [None]:
combined_data.columns

In [None]:
combined_data.head()

In [None]:
combined_data.shape

# Data Cleaning

The text data consists of html tags and some leading and trailing spaces and hence we will write some helper function to clean our pandas dataframe

In [None]:
def remove_html_tags(text):
    """Remove html tags from a string"""
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)  

In [None]:
def remove_space(text):
    "Remove spaces from the text"
    s=text
    s=s.strip()
    return s

In [None]:
combined_data['title']=combined_data['title'].apply(lambda x: remove_html_tags(x))
combined_data['content']=combined_data['content'].apply(lambda x: remove_html_tags(x))

In [None]:
combined_data['title']=combined_data['title'].apply(lambda x: remove_space(x))
combined_data['content']=combined_data['content'].apply(lambda x: remove_space(x))


In [None]:
combined_data=combined_data.drop_duplicates(subset=['title'],)      #### Removing rows with duplicate titles


In [None]:
combined_data.shape

In [None]:
combined_data.head()

In [None]:
combined_data.reset_index(drop=True,inplace=True)

# Data Preprocessing

Before doing further analysis and model training it is important to do some preprocessing. The following preprocessing will be done on the data

1. Tokenization
2. Stopwords Removal
3. Remove Punctuation
4. Lemmmatization
5. Lowercase all the words in the text
6. Combine the text of title and content column and train the model


In [None]:
def lemmatization(tokens):
    from nltk.stem.snowball import SnowballStemmer
    stemmer = SnowballStemmer("english")
    stemmed=[stemmer.stem(x) for x in tokens]
    return stemmed
    

In [None]:
def tokenize(text):
    from nltk.tokenize import sent_tokenize, word_tokenize 
    return word_tokenize(text)    

In [None]:
def remove_punctuation(tokens):
    words = [word for word in tokens if word.isalpha()]
    return words

In [None]:
def remove_stopwords(tokens):
    from nltk.corpus import stopwords
    stop_words = stopwords.words('english')
    extra_words = ['a', "a's", 'able', 'about', 'above', 'according', 'accordingly',
              'across', 'actually', 'after', 'afterwards', 'again', 'against',
              "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along',
              'already', 'also', 'although', 'always', 'am', 'among', 'amongst',
              'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone',
              'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear',
              'appreciate', 'appropriate', 'are', "aren't", 'around', 'as',
              'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away',
              'awfully', 'b', 'be', 'became', 'because', 'become', 'becomes',
              'becoming', 'been', 'before', 'beforehand', 'behind', 'being',
              'believe', 'below', 'beside', 'besides', 'best', 'better',
              'between', 'beyond', 'both', 'brief', 'but', 'by', 'c', "c'mon",
              "c's", 'came', 'can', "can't", 'cannot', 'cant', 'cause',
              'causes', 'certain', 'certainly', 'changes', 'clearly', 'co',
              'com', 'come', 'comes', 'concerning', 'consequently', 'consider',
              'considering', 'contain', 'containing', 'contains',
              'corresponding', 'could', "couldn't", 'course', 'currently', 'd',
              'definitely', 'described', 'despite', 'did', "didn't",
              'different', 'do', 'does', "doesn't", 'doing', "don't", 'done',
              'down', 'downwards', 'during', 'e', 'each', 'edu', 'eg', 'eight',
              'either', 'else', 'elsewhere', 'enough', 'entirely', 'especially',
              'et', 'etc', 'even', 'ever', 'every', 'everybody', 'everyone',
              'everything', 'everywhere', 'ex', 'exactly', 'example', 'except',
              'f', 'far', 'few', 'fifth', 'first', 'five', 'followed',
              'following', 'follows', 'for', 'former', 'formerly', 'forth',
              'four', 'from', 'further', 'furthermore', 'g', 'get', 'gets',
              'getting', 'given', 'gives', 'go', 'goes', 'going', 'gone', 'got',
              'gotten', 'greetings', 'h', 'had', "hadn't", 'happens', 'hardly',
              'has', "hasn't", 'have', "haven't", 'having', 'he', "he's",
              'hello', 'help', 'hence', 'her', 'here', "here's", 'hereafter',
              'hereby', 'herein', 'hereupon', 'hers', 'herself', 'hi', 'him',
              'himself', 'his', 'hither', 'hopefully', 'how', 'howbeit',
              'however', 'i', "i'd", "i'll", "i'm", "i've", 'ie', 'if',
              'ignored', 'immediate', 'in', 'inasmuch', 'inc', 'indeed',
              'indicate', 'indicated', 'indicates', 'inner', 'insofar',
              'instead', 'into', 'inward', 'is', "isn't", 'it', "it'd", "it'll",
              "it's", 'its', 'itself', 'j', 'just', 'k', 'keep', 'keeps',
              'kept', 'know', 'knows', 'known', 'l', 'last', 'lately', 'later',
              'latter', 'latterly', 'least', 'less', 'lest', 'let', "let's",
              'like', 'liked', 'likely', 'little', 'look', 'looking', 'looks',
              'ltd', 'm', 'mainly', 'many', 'may', 'maybe', 'me', 'mean',
              'meanwhile', 'merely', 'might', 'more', 'moreover', 'most',
              'mostly', 'much', 'must', 'my', 'myself', 'n', 'name', 'namely',
              'nd', 'near', 'nearly', 'necessary', 'need', 'needs', 'neither',
              'never', 'nevertheless', 'new', 'next', 'nine', 'no', 'nobody',
              'non', 'none', 'noone', 'nor', 'normally', 'not', 'nothing',
              'novel', 'now', 'nowhere', 'o', 'obviously', 'of', 'off', 'often',
              'oh', 'ok', 'okay', 'old', 'on', 'once', 'one', 'ones', 'only',
              'onto', 'or', 'other', 'others', 'otherwise', 'ought', 'our',
              'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'own',
              'p', 'particular', 'particularly', 'per', 'perhaps', 'placed',
              'please', 'plus', 'possible', 'presumably', 'probably',
              'provides', 'q', 'que', 'quite', 'qv', 'r', 'rather', 'rd', 're',
              'really', 'reasonably', 'regarding', 'regardless', 'regards',
              'relatively', 'respectively', 'right', 's', 'said', 'same', 'saw',
              'say', 'saying', 'says', 'second', 'secondly', 'see', 'seeing',
              'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves',
              'sensible', 'sent', 'serious', 'seriously', 'seven', 'several',
              'shall', 'she', 'should', "shouldn't", 'since', 'six', 'so',
              'some', 'somebody', 'somehow', 'someone', 'something', 'sometime',
              'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry',
              'specified', 'specify', 'specifying', 'still', 'sub', 'such',
              'sup', 'sure', 't', "t's", 'take', 'taken', 'tell', 'tends', 'th',
              'than', 'thank', 'thanks', 'thanx', 'that', "that's", 'thats',
              'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence',
              'there', "there's", 'thereafter', 'thereby', 'therefore',
              'therein', 'theres', 'thereupon', 'these', 'they', "they'd",
              "they'll", "they're", "they've", 'think', 'third', 'this',
              'thorough', 'thoroughly', 'those', 'though', 'three', 'through',
              'throughout', 'thru', 'thus', 'to', 'together', 'too', 'took',
              'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying',
              'twice', 'two', 'u', 'un', 'under', 'unfortunately', 'unless',
              'unlikely', 'until', 'unto', 'up', 'upon', 'us', 'use', 'used',
              'useful', 'uses', 'using', 'usually', 'uucp', 'v', 'value',
              'various', 'very', 'via', 'viz', 'vs', 'w', 'want', 'wants',
              'was', "wasn't", 'way', 'we', "we'd", "we'll", "we're", "we've",
              'welcome', 'well', 'went', 'were', "weren't", 'what', "what's",
              'whatever', 'when', 'whence', 'whenever', 'where', "where's",
              'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon',
              'wherever', 'whether', 'which', 'while', 'whither', 'who',
              "who's", 'whoever', 'whole', 'whom', 'whose', 'why', 'will',
              'willing', 'wish', 'with', 'within', 'without', "won't", 'wonder',
              'would', 'would', "wouldn't", 'x', 'y', 'yes', 'yet', 'you',
              "you'd", "you'll", "you're", "you've", 'your', 'yours',
              'yourself', 'yourselves', 'z', 'zero', '','is','based','aa','aaa','aac','aad','aav','ab','aa','aa aa',
 'aa ab',
 'aa batteri',
 'aa lt',
 'aaa',
 'aabb',
 'aabb aabb',
 'aabbcc',
 'aac',
 'aad',
 'aasa',
 'aav',
 'ab']
    
    new_stop=stop_words + extra_words
    new_stop=list(set(new_stop))
    filtered_words=[word for word in tokens if word not in new_stop]
    return filtered_words

In [None]:
def lower_word(tokens):
    words = [word.lower() for word in tokens]
    return words


In [None]:
combined_data['title_words']=combined_data['title'].apply(lambda x: tokenize(x))

In [None]:
combined_data['content_words']=combined_data['content'].apply(lambda x: tokenize(x))

In [None]:
combined_data['title_words']=combined_data['title_words'].apply(lambda x: remove_punctuation(x))
combined_data['content_words']=combined_data['content_words'].apply(lambda x: remove_punctuation(x))


In [None]:
combined_data['title_words']=combined_data['title_words'].apply(lambda x: lower_word(x))
combined_data['content_words']=combined_data['content_words'].apply(lambda x: lower_word(x))

In [None]:
combined_data['title_words']=combined_data['title_words'].apply(lambda x: remove_stopwords(x))
combined_data['content_words']=combined_data['content_words'].apply(lambda x: remove_stopwords(x))

In [None]:
combined_data.reset_index(drop=True,inplace=True)

In [None]:
combined_data.head()

In [None]:
combined_data['text']=combined_data['title_words']+ combined_data['content_words']

In [None]:
combined_data.head()

In [None]:
#combined_data.loc[0,'title_words']

In [None]:
combined_data['text']=combined_data['text'].apply(lambda x: lemmatization(x))

In [None]:
combined_data.text.head()

In [None]:
combined_data['text']=combined_data['text'].apply(lambda x: ' '.join(x))

In [None]:
####################### Save Combined Data ############################################################
#combined_data.to_csv('combined_data_preprocessed.csv',index=False)

# EDA (Exploratory Data Analysis)

Before training the model it is important to understand and explore the data. The following analysis will be done:

1. No. of datapoints corresponding to each category
2. Distribution of tags
3. Number of unique tags
4. Number of questions covered by taking x% of sample

## Datapoints corresponding to each category 

In [None]:
biology_data=biology_data.drop_duplicates(subset=['title'],) #### Removing rows with duplicate titles
travel_data=travel_data.drop_duplicates(subset=['title'],)      #### Removing rows with duplicate titles
cooking_data=cooking_data.drop_duplicates(subset=['title'],)      #### Removing rows with duplicate titles
robotics_data=robotics_data.drop_duplicates(subset=['title'],)      #### Removing rows with duplicate titles
diy_data=diy_data.drop_duplicates(subset=['title'],)      #### Removing rows with duplicate titles
crypto_data=crypto_data.drop_duplicates(subset=['title'],)      #### Removing rows with duplicate titles


In [None]:
datapoints=[]
datapoints.extend((biology_data.shape[0],travel_data.shape[0],cooking_data.shape[0],robotics_data.shape[0],diy_data.shape[0],
                  crypto_data.shape[0]))
topics=['biology','travel','cooking','robotics','diy','crypto']
topic_count=pd.DataFrame({'topics':topics,'datapoints':datapoints})
topic_count['percentage']=(topic_count['datapoints']/topic_count['datapoints'].sum())*100

In [None]:
topic_count.head()

In [None]:
sns.barplot(x=topic_count['topics'],y=topic_count['datapoints'])

In [None]:
sns.barplot(x=topic_count['topics'],y=topic_count['percentage'])

The following observations can be drawn from the above plots:

1. diy has the highest number of datapoints (30%)
2. Robotics has the lowest number of datapoints (3.2 %)

This makes our problem statement harder because the language and vocabulary of robotics would have been closer to physics but the number of training datapoints are very less


### Distribution of tags

In [None]:
combined_data.head()

In [None]:
tags_count = combined_data["tags"].apply(lambda x: len(x.split(" "))) # counting the number of tags for each datapoint


In [None]:
combined_data['Tags_Count'] = tags_count


In [None]:
combined_data.head()

In [None]:
print("Maximum number of tags per question = "+str(max(combined_data['Tags_Count'])))
print("Minimum number of tags per question = "+str(min(combined_data['Tags_Count'])))
print("Avg number of tags per question = "+str(sum(combined_data['Tags_Count'])/len(combined_data['Tags_Count'])))

In [None]:
questions_per_tag=combined_data['Tags_Count'].value_counts()
questions_per_tag=pd.DataFrame(questions_per_tag)
questions_per_tag.reset_index(level=0,inplace=True)

In [None]:
questions_per_tag=questions_per_tag.rename(columns={'index':'tag_count','Tags_Count':'question_count'})

In [None]:
questions_per_tag['percentage']=(questions_per_tag['question_count']/questions_per_tag['question_count'].sum())*100

In [None]:
questions_per_tag.head()

In [None]:
sns.barplot(x=questions_per_tag['tag_count'],y=questions_per_tag['question_count'])

In [None]:
sns.barplot(x=questions_per_tag['tag_count'],y=questions_per_tag['percentage'])

From the above plot the following observations could be drawn

1. Only 8% of questions contains 5 tags
2. Majority of questions 30% contains 2 tags


## Tag Overlap 

In [None]:
combined_data['text_original']=combined_data['title_words'] + combined_data['content_words']

In [None]:
combined_data['tags_words']=combined_data['tags'].apply(lambda x: x.split(' '))

In [None]:
combined_data.drop(['text_original'],inplace=True,axis=1)

In [None]:
combined_data.head()

In [None]:
for i in range(0,len(combined_data)):
    #print(i)
    tag_words=combined_data.loc[i,'tags_words']
    title_words=combined_data.loc[i,'title_words']
    common_title_words=set(tag_words)&set(title_words)
    combined_data.loc[i,'title_overlap']=len(common_title_words)

In [None]:
for i in range(0,len(combined_data)):
    #print(i)
    tag_words=combined_data.loc[i,'tags_words']
    content_words=combined_data.loc[i,'content_words']
    common_content_words=set(tag_words)&set(content_words)
    combined_data.loc[i,'content_overlap']=len(common_content_words)

In [None]:
combined_data.head()

In [None]:
combined_data['title_overlap_percent']=(combined_data['title_overlap']/combined_data['Tags_Count'])*100

In [None]:
combined_data['content_overlap_percent']=(combined_data['content_overlap']/combined_data['Tags_Count'])*100

In [None]:
combined_data.head()

In [None]:
print("Average Title Overlap = {}".format(combined_data.title_overlap_percent.mean()))
print("Average Content Overlap = {}".format(combined_data.content_overlap_percent.mean()))


## Number of Unique Tags 

In [None]:
vectorizer = CountVectorizer(tokenizer = lambda x: x.split(" "))
tagcount = vectorizer.fit_transform(combined_data['tags'])

In [None]:
print("Total number of datapoints = {}".format(tagcount.shape[0]))
print("Total number of unique tags = {}".format(tagcount.shape[1]))

In [None]:
print(vectorizer.get_feature_names()[0:10])

## Most frequent tags 

In [None]:
#top 10 highest occurring tags
col_sum = tagcount.sum(axis = 0).A1 
feat_count = dict(zip(vectorizer.get_feature_names(), col_sum))
feat_count_sorted = dict(sorted(feat_count.items(), key = lambda x: x[1], reverse = True))
count_data = {"Tags":list(feat_count_sorted.keys()), "Count": list(feat_count_sorted.values())}
count_df = pd.DataFrame(data = count_data)
count_df[:10]

In [None]:
count_df['Percentage']=(count_df['Count']/count_df['Count'].sum())*100

In [None]:
count_df=count_df[:10]

In [None]:
count_df

In [None]:
plt.figure(figsize = (12, 4))
sns.barplot(x=count_df['Tags'],y=count_df['Count'])

In [None]:
plt.figure(figsize = (12, 4))
sns.barplot(x=count_df['Tags'],y=count_df['Percentage'])

The above graph depicts the percentage of occurence of top 10 tags

1. The highest occuring tag electrical belongs to only 4489 (2 %) of the questions
2. Tags like i.e. visas, air-travel, usa, schengen and uk belong to travel which is not a science related topic

## Questions covered by Top n tags 

In [None]:
vectorizer = CountVectorizer(tokenizer = lambda x: x.split(" "), binary = True)
labels = vectorizer.fit_transform(combined_data['tags'])

In [None]:
labels.shape

In [None]:
col_sum = labels.sum(axis = 0).A1   
col_sum

In [None]:
sorted_tags = np.argsort(-col_sum)  
sorted_tags

In [None]:
def top_n_tags(n):
    multilabel_yn = labels[:,sorted_tags[:n]] 
    return multilabel_yn

def questionsCovered(n):
    multilabel_yn = top_n_tags(n)
    NonZeroQuestions = multilabel_yn.sum(axis = 1)  
    return np.count_nonzero(NonZeroQuestions), NonZeroQuestions

In [None]:
questionsExplained = []
numberOfTags = []
for i in range(500,4268,500):
    questionsExplained.append(round((questionsCovered(i)[0]/labels.shape[0])*100,2))
    numberOfTags.append(i)
    
plt.figure(figsize = (16, 8))
plt.plot(numberOfTags, questionsExplained)
plt.title("Number of Tags VS Percentage of Questions Explained(%)", fontsize=20)
plt.xlabel("Number of Tags", fontsize=15)
plt.ylabel("Percentage of Questions Explained(%)", fontsize=15)
plt.scatter(x = numberOfTags, y = questionsExplained, c = "blue", s = 70)
for x, y in zip(numberOfTags, questionsExplained):
    plt.annotate(s = '({},{}%)'.format(x, y), xy = (x, y), fontweight='bold', fontsize = 12, xytext=(x+70, y-0.3), rotation = -20)

The following import conclusions can be drawn:
1. 90 % of the questions are covered if we take top 500 labels
2. 96 % of the questions are covered if we take top 1000 labels

Based on the fact that we have limited computation power it would make sense to take top 500 labels and train our multilabel classification problem

In [None]:
sumOfRows = questionsCovered(500)[1]
RowIndicesZero = np.where(sumOfRows == 0)[0]  #this contains indices of all the questions for which the tags are removed
data_new = combined_data.drop(labels = RowIndicesZero, axis = 0)
data_new.reset_index(drop = True, inplace = True)
print("Size of new data = ",data_new.shape[0])

In [None]:
#removing tags from data
data_tags = top_n_tags(500)
df = pd.DataFrame(data_tags.toarray())
TagsDF_new = df.drop(labels = RowIndicesZero, axis = 0)
TagsDF_new.reset_index(drop = True, inplace = True)
print("Size of new data = ",TagsDF_new.shape[0])

As we can see that after taking the top 500 labels we are covering 78276 (90%) questions out of 86968 questions    

# Train and Validation Split

Now we will split our dataset into train and validation set and after that we will perform feature extraction using TF-IDF technique in order to feed our data to the machine learning model

In [None]:
allTags = sparse.csr_matrix(TagsDF_new.values)
x_train, x_test, y_train, y_test = train_test_split(data_new, allTags, test_size=0.20, random_state=42)


In [None]:
print("Train data shape ",x_train.shape)
print("Train label shape", y_train.shape)
print("Test data shape ",x_test.shape)
print("Test label shape", y_test.shape)

# Feature Extraction - Using TF-IDF

In order to train our multilabel classification model it is important to extract features from our text so that we can feed it into our classification algorithm.

We will use TF-IDF method to extract features from our dataset



In [None]:
vectorizer = TfidfVectorizer(max_features=10000, ngram_range = (1,3), tokenizer = lambda x: x.split(" "))
TrainData = vectorizer.fit_transform(x_train['text'])
TestData = vectorizer.transform(x_test['text'])


In [None]:
#sparse.save_npz("FinalTrain.npz", TrainData)       ####### Saving Training and Test data in sparse format for later use  
#sparse.save_npz("FinalTest.npz", TestData)
#sparse.save_npz("FinalTrainLabels.npz", y_train)
#sparse.save_npz("FinalTestLabels.npz", y_test)

In [None]:
#FinalTrain = sparse.load_npz("FinalTrain.npz")      ####### Loading Training and Test Data for training
#FinalTest = sparse.load_npz("FinalTest.npz")
#FinalTrainLabels = sparse.load_npz("FinalTrainLabels.npz")
#FinalTestLabels = sparse.load_npz("FinalTestLabels.npz")

FinalTrain=TrainData
FinalTest=TestData
FinalTrainLabels=y_train
FinalTestLabels=y_test

print("Dimension of train data = ",TrainData.shape)
print("Dimension of test data = ",TestData.shape)
print("Dimension of train labels ",y_train.shape)
print("Dimension of Test labels ", y_test.shape)

# Model Training - Logistic Regression

Now since we have splitted our dataset into train and validation set and also we have extracted tf-idf features from our dataset we are ready to train our model

Since it is a multilabel classification problem we will use Logistic Regression with one vs rest strategy in order to train our model

In [None]:
classifier= OneVsRestClassifier(LogisticRegression(C=0.9,penalty='l1',solver='saga'), n_jobs=-1)
classifier.fit(FinalTrain, FinalTrainLabels)
predictions = classifier.predict(FinalTest)

In [None]:
prediction_train=classifier.predict(FinalTrain)

In [None]:
print("Train Accuracy :",accuracy_score(FinalTrainLabels,prediction_train))
print("Train Macro f1 score :",f1_score(FinalTrainLabels, prediction_train, average = 'macro'))
print("Train Micro f1 scoore :",f1_score(FinalTrainLabels, prediction_train, average = 'micro'))
print("Train Classification Report :\n",classification_report(FinalTrainLabels, prediction_train))


In [None]:
print("Validation Accuracy :",accuracy_score(FinalTestLabels,predictions))
print("Validation Macro f1 score :",f1_score(FinalTestLabels, predictions, average = 'macro'))
print("Validation Micro f1 scoore :",f1_score(FinalTestLabels, predictions, average = 'micro'))
print("Validation Classification Report :\n",classification_report(FinalTestLabels, predictions))


In [None]:
################## Save Model for later use #################################################
##filename = 'best_model_l1_saga_f1_0.47.sav'
#joblib.dump(classifier, filename)

# Model Training- Naive Bayes


In [None]:
classifier_1= OneVsRestClassifier(MultinomialNB(alpha=0.35), n_jobs=-1)
classifier_1.fit(FinalTrain, FinalTrainLabels)
predictions_1 = classifier_1.predict(FinalTest)

In [None]:
prediction_train_1=classifier_1.predict(FinalTrain)

In [None]:
print("Train Accuracy :",accuracy_score(FinalTrainLabels,prediction_train_1))
print("Train Macro f1 score :",f1_score(FinalTrainLabels, prediction_train_1, average = 'macro'))
print("Train Micro f1 scoore :",f1_score(FinalTrainLabels, prediction_train_1, average = 'micro'))
print("Train Classification Report :\n",classification_report(FinalTrainLabels, prediction_train_1))

In [None]:
print("Validation Accuracy :",accuracy_score(FinalTestLabels,predictions_1))
print("Validation Macro f1 score :",f1_score(FinalTestLabels, predictions_1, average = 'macro'))
print("Validation Micro f1 scoore :",f1_score(FinalTestLabels, predictions_1, average = 'micro'))
print("Validation Classification Report :\n",classification_report(FinalTestLabels, predictions_1))

# Generating predictions on Test Set

In [None]:
test_data.head()

In [None]:
test_data['title']=test_data['title'].apply(lambda x: remove_html_tags(x))
test_data['content']=test_data['content'].apply(lambda x: remove_html_tags(x))

In [None]:
test_data['title']=test_data['title'].apply(lambda x: remove_space(x))
test_data['content']=test_data['content'].apply(lambda x: remove_space(x))


In [None]:
test_data['title_words']=test_data['title'].apply(lambda x: tokenize(x))

test_data['content_words']=test_data['content'].apply(lambda x: tokenize(x))

In [None]:
test_data['title_words']=test_data['title_words'].apply(lambda x: remove_punctuation(x))
test_data['content_words']=test_data['content_words'].apply(lambda x: remove_punctuation(x))

In [None]:
test_data['title_words']=test_data['title_words'].apply(lambda x: lower_word(x))
test_data['content_words']=test_data['content_words'].apply(lambda x: lower_word(x))

In [None]:
test_data['title_words']=test_data['title_words'].apply(lambda x: remove_stopwords(x))
test_data['content_words']=test_data['content_words'].apply(lambda x: remove_stopwords(x))


In [None]:
test_data.reset_index(drop=True,inplace=True)
test_data['text']=test_data['title_words']+ test_data['content_words']


In [None]:
test_data['text']=test_data['text'].apply(lambda x: lemmatization(x))


In [None]:
test_data.head()

In [None]:
test_data['text']=test_data['text'].apply(lambda x: ' '.join(x))
#test_data.to_csv('test_data_preprocessed.csv',index=False)

In [None]:
test_data.head()

In [None]:
test_data.to_csv('test_data_preprocessed.csv',index=False)

In [None]:
#test_data['title_words']=test_data['title_words'].apply(lambda x: ast.literal_eval(x))
#test_data['content_words']=test_data['content_words'].apply(lambda x: ast.literal_eval(x))

In [None]:
#test_data['text']=test_data['title_words']*3 + test_data['content_words']

In [None]:
test_data.text.head()

In [None]:
test_data.columns

In [None]:
len(vectorizer.get_feature_names())

In [None]:
#vectorizer = TfidfVectorizer(max_features=50000, ngram_range = (1,3), tokenizer = lambda x: x.split(" "))
#TrainData = vectorizer.fit_transform(x_train['text'])
test_data_features = vectorizer.transform(test_data['text'])   ##### We will use the vectoriser which we fit on training data

In [None]:
test_data_features.shape

In [None]:
predictions_test=classifier.predict(test_data_features)

In [None]:
predictions_test.shape

In [None]:
predictions_test_df=pd.DataFrame(predictions_test.toarray())

In [None]:
predictions_test_df.head()

# Get Column Names 

In [None]:
vectorizer_label = CountVectorizer(tokenizer = lambda x: x.split(" "), binary = True)
new_labels = vectorizer_label.fit_transform(combined_data['tags'])

In [None]:
len(vectorizer_label.get_feature_names())

In [None]:
top_label_indices=sorted_tags[0:500]

In [None]:
top_500=[vectorizer_label.get_feature_names()[i] for i in top_label_indices]


# Predictions Probability

In [None]:
predictions_probability=classifier.predict_proba(test_data_features)

In [None]:
predictions_probability=pd.DataFrame(predictions_probability)

In [None]:
predictions_probability.columns=top_500

In [None]:
predictions_probability.head()

In [None]:
for c in predictions_probability.columns.values.tolist():
    predictions_probability[c]=np.where(predictions_probability[c] >= 0.03,1,0)

# Prepare test data for Submission

In [None]:
cols_test = predictions_probability.columns


In [None]:
bt = predictions_probability.apply(lambda x: x > 0)


In [None]:
bt.head()

In [None]:
result=bt.apply(lambda x: list(cols_test[x.values]), axis=1)

In [None]:
result=pd.DataFrame(result)

In [None]:
result.columns=['tag']

In [None]:
result['tag']=result['tag'].apply(lambda x: ' '.join(x))

In [None]:
test_data.reset_index(drop=True,inplace=True)
result.reset_index(drop=True,inplace=True)
final_result=pd.concat([test_data,result],axis=1)

In [None]:
final_result.tag.unique()

In [None]:
final_result.loc[final_result['tag']=="electrical"]

In [None]:
final_result.head()

In [None]:
submission=final_result[['id','tag']]

In [None]:
submission.columns=['id','tags']

In [None]:
submission.head()

In [None]:
submission.to_csv('twelth_submission.csv',index=False)

# Conclusion

Based on the above analysis we conclude that the model performs satisfactory on Train and Validation set but the performance on test set is low.

1. Train Set: 0.50 (Micro F1 score)
2. Validation Set: 0.48 (Micro F1 score)
3. Test Set (Kaggle Submission): 0.011 (Micro F1 score)

The major reason for low F1 score on test set could be that the domain and vocabulary of the test set is very different from the one on which the model was trained on.

Although the score is less but by doing some more extensive feature engineering and hyperparameter tuning we can improve the performance of the model on test set

# Next Steps

Currently we have used TF-IDF for feature extraction and Logistic Regression for training a classifier model

In future we can improve the performance of the model by trying different techniques of feature extraction and model training

Feature Extraction Techniques
1. Increase the vocabulary of the TF-IDF feature extractor 
2. Extract POS tags from the text and use those POS tags as features to train our model
3. Perform LDA and use topic vectors as features to train the multilabel classification model
4. Train word2vec and doc2vec model on the data and use it to extract features from the data 
5. Extract features using Pretrained Models like Universal Sentence Encoder (USE) and BERT
6. Dimension Reduction techniques like PCA
7. During EDA we observed that content is matching with the tags more than title and hence we can give more weightage to content words as compared to title word and then train the model

Model Training techniques

1. Tree based models like Random Forest and XGBoost
2. Neural Network Models like CNN, RNN and DNN
3. Support Vector Machine (SVM)


Important Note: Currently in our multilabel classification problem we have made the assumption that all the labels are independent of each other. This assumption needs to be validated and accordingly the techniques needs to be modified  
