# Problem Statement

Aim of this exercise is to classify the websites based on their content. Here, we have got a dataset which contains list of websites, their kind and text extract from them.

Here, we would be using `Natural Language Processing(NLP)` techniques to develop out dataset and then create a Naive Bayes model based on it to classify websites.


## Methodology

1. Data exploring and understanding: This step involves preliminary level data understanding and exploring.

2. Data Cleaning: This step includes cleaning the existing data. We check the data for any missing values and treat them as per the requirements. We also need to look for constant value columns as that is not going to add any additional value to out analysis. Sometimes columns with very high proportion of any particular value also doesn't add any values. Hence, getting rid of them helps with further analysis.

3. Data Preparation: This step is mainly useful for feeding in the data into the model. It involves steps like creating dummy variables, scaling etc. depending upon the data type.  Here, we would be using `NLP` techniques to develop our dataset which can be fed to model.

4. Data Visualization: This step involves visualizing our dataset and check for relationship amongst independent variables. We can also reduce some feature columns here but it should not be aggressive.

5. Train-test split: This step involves splitting the dataset into train and test parts.

6. Model Development-validation and evaluation: This steps involves training the model and validate it. It involves evaluating the model using relevant metrics.

7. Conclusion/Recommendation: It involves drawing conclusions and recommendations to business.

### Importing Dependencies

In [None]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.naive_bayes import MultinomialNB

import nltk
#nltk.download()
from nltk.tokenize import word_tokenize

import warnings
warnings.filterwarnings('ignore')

### Reading Dataset

In [None]:
import os 
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames: 
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv(os.path.join(dirname, filename))
data.head()

Dataset is having 4 columns including target variable `Category`.

## 1. Data exploring and understanding

In [None]:
data.shape

Data is having 1408 entries.
Here we have first column which indicates serial hence we would be dropping it.

In [None]:
data = data.drop('Unnamed: 0', axis=1)
data.head()

In [None]:
data.info()

Dataset is not having any null entry hence we do not need to treat it for null values.

In [None]:
# website_url

data.website_url.nunique()

It can be observed that `website_url` column is having 1384 unique entries. That means we have some duplicate entries in our dataset. We would be dropping those entries. 

## 2. Data Cleaning

In [None]:
# Dropping duplicate entries

data = data.drop_duplicates(subset='website_url').reset_index(drop=True)
data.shape

In [None]:
# Category

data.Category.value_counts(dropna=False)

`Category` column is having 16 unique entries, which means this is a multi-class classification problem. Hence, we would be using `MultinomialNB` algorithm to solve it. MultinomialNB is efficient algorithm and can handle multi-class problems easily.

In [None]:
# cleaned_website_text

data.cleaned_website_text

`cleaned_website_text` column contains the content text from the website. We would be using this column to develop features for out model.

## 3. Data Preparation

#### Feature matrix development method:

We would extract words from `cleaned_website_text` columns and create an array of those words. Out of that array we would filter out english language stop_words which generally doesn't give any meaningful insight about the document. Out of the remaining words we would chose top `n` most recurring words and consider them as feature. 

Next part is to assign feature value to each of the data entry. For assigning feature value, we would use `TF-IFD` method.

`TF-IDF:` TF-IDF is a method of comparing how relevant a word is to any particular document (here website) from a given bunch of documents. 

$TF-IDF = TF(t,d)*IDF(t,D)$

where, 

TF(t,d) = (count of t in d)/(number of words in d)

$IDF(t,D) = log(D/(df + 1))$

$t$--> term of interest

$d$ --> document of interest

$D$ --> Set of documents

$df$ --> occurrence of t in documents

To summarize, TF-IDF score of any feature for any particular document would be high if that feature term occurs rarely in other documents and/or percentage occurrence for that term in the given document is higher.


Additional Reading Source: 
https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089

Before developing TF-IDF based array we need to filter out words with `Parts Of Speech (POS)` other than `NOUN`. The logic behind it is that `NOUN` gives better representation of website content as compared to other `POS`.

Here we would be using `NLTK` library to tokenize the corpus from each website and assign `POS` tag to each word. Then we would filter in only `NOUN` tagged words and reverse convert them into corpus. The filtered words corpus would be meaningless for us but can enhance the performance of our model.

In [None]:
# filtering noun words from corpus

filtered_words = []
for i in range(data.shape[0]):
    tagged = nltk.pos_tag(word_tokenize(data.cleaned_website_text[i]), tagset='universal')
    filtered_words.append(' '.join([word_tag[0] for word_tag in tagged if word_tag[1]=='NOUN']))

In [None]:
# adding filtered_words to our dataframe

data['filtered_words'] = filtered_words
data.head()

Let's develop TF-IDF based array using `filtered_words` column from out dataset. Here we have selected some hyperparameters as per following.

- min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. We would use 5 as min frequency.

- stop_words: We would remove English stopwords if at all they are present in the filtered words. Stopwords adds no meaning to the model.

- max_features: Maximum features to be included the the data. We are going to use 1500 features. 

Note: Naive Bayes algorithm can easily handle larger number of features as compared to data size.

In [None]:
# Vectorized form of tfidf 

tfidf = TfidfVectorizer(sublinear_tf=True,
                        min_df=5,
                        stop_words = 'english',
                        max_features = 1500)
feat = tfidf.fit_transform(data.filtered_words)

In [None]:
# feature/term names

tfidf.get_feature_names()

In [None]:
# Converting tfidfvector to dataframe

X = pd.DataFrame(feat.toarray(), columns = tfidf.get_feature_names())
X.head()

In [None]:
# checking range for values in X

X.describe()

In [None]:
# encoding labes to derive y

le = preprocessing.LabelEncoder()
y = le.fit_transform(data.Category)
y

## 4. Data Visualization

In [None]:
df_f = pd.concat([X, data.Category], axis=1)
df_f.head()

In [None]:
plt.subplots(figsize=(10,15))
for i, col in enumerate(df_f.columns[0:5]):
    plt.subplot(int(df_f.columns[0:5].shape[0]/2)+1, 2, i+1)
    df_f.groupby('Category').mean()[col].plot(kind='barh', title=col)

Observations,

- Term `accessories` is more related to `E-Commerce`. 
- Term `accessibility` is more related to `Law and Government`. 

## 5. Train-test split

In [None]:
# train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                     train_size = 0.7, 
                                     test_size = 0.3, 
                                     random_state = 100,
                                     stratify = y)

## 6. Model Development-validation and evaluation

In [None]:
# Model development 

model = MultinomialNB()
model.fit(X_train, y_train)

In [None]:
# Making prediction and evaluation

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Accuracy check

print('Train accuracy:', round(accuracy_score(y_train, y_train_pred),2))
print('Test accuracy:', round(accuracy_score(y_test, y_test_pred),2))

#### Confusion Matrix

In [None]:
plt.subplots(figsize=(12,8))
sns.heatmap(confusion_matrix(y_train, y_train_pred),
           annot=True,
           cmap='YlGnBu')

In [None]:
plt.subplots(figsize=(12,8))
sns.heatmap(confusion_matrix(y_test, y_test_pred),
           annot=True,
           cmap='YlGnBu')

Confusion matrix indicates that,

- Model is not able to estimate category 0 and 6 correctly both while training and testing. The reason can be their lesser frequency.
- Model is clearly not overfitting which is a good sign.

## 7. Conclusion/Recommendation

- NLP can be used for text processing and feature development to be used in ML model building.
- Multinomial Naive Bayes algorithm can be used to perform multi-class classification with large number of classes.
- Naive Bayes in general can handle large number of features as compared to data size quite easily.
- For future work, we can increase number of data entries for classes with less frequency and hence improve the model further.

-----------------