**Key Challenge :** 

Some fields like "name" and "item description" are having substantial text content to work with. Hence, treating those fields as simple categorical features and converting them into label encoded features might not give satisfactory result to predict the price of a mercari product. We need to apply advanced text processing.

**Key Takeways from this Kernel :**

* Performance of Supervised Predictive Model without/with using Advanced Text Processing 
* Compare and See the Difference! **(for above)**
* Concept of CountVectorizer, TfidfVectorizer, LabelBinarizer
* Concept of Sparse Matrices
* Concept of Topic Modelling and Latent Dirichlet Allocation (Unsupervised method)
* Using **"pyLDAvis"** for visualizing topics in LDA Topic Model

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function() {
    return False;
}

In [None]:
!apt-get install p7zip

In [None]:
!p7zip -d '../input/mercari-price-suggestion-challenge/train.tsv.7z'

Let us import training and testing data sets.

In [None]:
import pandas as pd
train_data = pd.read_csv("train.tsv", sep='\t') 
test_data = pd.read_csv("../input/mercari-price-suggestion-challenge/test_stg2.tsv.zip" , sep='\t')

Now, let us have a look at the data and data types.

In [None]:
train_data.shape

In [None]:
test_data.shape

In [None]:
train_data.head()

In [None]:
train_data.info()

In [None]:
test_data.head()

In [None]:
test_data.info()

We are creating a copy of training data set. We will initially work with the traing data copy. Later we will use the original training data set for applying advanced text pre-processing on it.

In [None]:
train_copy = train_data.copy()

Let us free up some memory to avoid getting "memory exceeded" warning.

In [None]:
import gc
gc.collect()

### Data Pre-Processing :

We will split category name into three parts: (1) Main Category, (2) First Sub-Category, (3) Second Sub-Category. Whenever blanks are found, they will be replaced as "No Label" for these three. Then we will apply label encoding on them.

In [None]:
# Splitting category name
def split_cat(text):
    try: return text.split("/")
    except: return ("No Label", "No Label", "No Label") 

In [None]:
# Splitting category name into: Main Category, SubCategory_1, SubCategory_2
train_copy['main_category'], train_copy['subcat_1'], train_copy['subcat_2'] = zip(*train_copy['category_name'].apply(lambda x: split_cat(x)))

In [None]:
# Label Encoding
from sklearn import preprocessing
def toNumeric(data,to):
    if train_copy[data].dtype == type(object):
        le = preprocessing.LabelEncoder()
        train_copy[to] = le.fit_transform(train_copy[data].astype(str))   
toNumeric('name','n_name')
toNumeric('category_name','n_category_name')
toNumeric('brand_name','n_brand_name')
toNumeric('main_category','n_main_category')
toNumeric('subcat_1','n_subcat_1')
toNumeric('subcat_2','n_subcat_2')
train_copy.head()

### Data Cleaning :
We will apply basic data cleaning like: filling up missing data, dropping NA etc.

In [None]:
#Checking for NULL values in the columns
train_copy.isnull().any()

Category_name, brand_name and item_description have null values. So we will fill up missing data for these coulumns.

In [None]:
def fill_missing_data(data):
    data.category_name.fillna(value = "Other/Other/Other", inplace = True)
    data.brand_name.fillna(value = "Unknown brand", inplace = True)
    data.item_description.fillna(value = "No description", inplace = True)
    return data

In [None]:
import numpy as np
train_copy = fill_missing_data(train_copy)
train_copy = train_copy.dropna()
print(np.shape(train_copy))
train_copy.head()

main_category, subcat_1, subcat_2 are now segregated. Also we have got separate columns for corresponding label encoded values of those parameters.

### Exploratory Data Analysis :

We will proceed with doing some EDA now to explore some interesting findings.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(context='notebook')
sns.set_style("whitegrid", {'axes.grid' : False})
plt.tight_layout()

#### [ Sale Price ]

In [None]:
print("Range of price : ")
print("Minimum Price: ",'$', train_copy["price"].min())
print("Maximum Price: ",'$', train_copy["price"].max())
fig, ax = plt.subplots(3, 1, figsize = (13, 16))
ax[0].hist(train_copy.price, bins = 100, range = [min(train_copy.price), max(train_copy.price)+100], label = "price", color='red', alpha=0.7)
ax[0].annotate(' Outliers\n present\n till\n this point', xy=(max(train_copy.price), 100), xytext=(max(train_copy.price), 125000), arrowprops=dict(facecolor='black'), color='black')
ax[0].set_title("Histogram of Price Distribution", fontsize = 13)
ax[0].set_xlabel("Price", fontsize = 10)
ax[0].set_ylabel("Frequency ", fontsize = 10)

ax[1].set_title("Histogram of Price Distribution (Focused Mode)", fontsize = 13)
ax[1].hist(train_copy.price, bins = 100, range = [0, 200], label = "price", color='red', alpha=0.7)
ax[1].set_xlabel("Price", fontsize = 10)
ax[1].set_ylabel("Frequency ", fontsize = 10)


sns.boxplot(train_copy.price, showfliers = False, ax = ax[2], linewidth=0.7, color='red')
ax[2].set_title("Box Plot for Price Distribution", fontsize = 13)
ax[2].set_xlabel("Price", fontsize = 10)
plt.show()

Price distribution is right-skewed and not quite in the shape of conforming with Normal Distribution. The range of outliers for price is very wide.

#### [ Brand Name ]

In [None]:
brands = train_copy["brand_name"].value_counts()
print("No. of Unique Brand Names :", brands.size)
fig, ax = plt.subplots(1, 2, figsize = (13, 6))
# we skipped '0' index and started from 1st because 0th index has "unknown brands"
sns.barplot(brands[1:11].values, brands[1:11].index,ax = ax[0], edgecolor='k', linewidth=0.5, palette='rocket') 
ax[0].set_title("Top 10 Most Frequently Used Brand Names", fontsize = 13)
ax[0].set_xlabel("Counts", fontsize = 10)
ax[0].set_ylabel("Brand Name", fontsize = 10)

import pandas as pd
top10_brands = train_copy.groupby('brand_name', axis=0).mean()
df_expPrice = pd.DataFrame(top10_brands.sort_values('price', ascending = False)['price'][0:10].reset_index())
ax[1].set_title("Top 10 Most Costly Brands", fontsize = 13)
ax[1] = sns.barplot(x="brand_name", y="price", data=df_expPrice, edgecolor='k', linewidth=0.5, palette='PuRd')
ax[1].set_xlabel("Brand Name", fontsize = 10)
ax[1].set_ylabel("Sale Price", fontsize = 10)
ax[1].set_xticklabels(ax[1].get_xticklabels(),rotation=35)
plt.show()

"Victoria's Secret" is the most widely used brand whereas "Demdaco" is the costliest brand in the lot. 

#### [ Item Condition Id ]

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (13,6))
sns.countplot(train_copy.item_condition_id, ax = ax[0], palette='Blues_r', edgecolor='k', linewidth=0.6)
rectangles = ax[0].patches
ax[0].set_title("Count Distribution of Item Condition Ids", fontsize = 13)
labels = train_copy.item_condition_id.value_counts().values
for rect, label in zip(rectangles, labels):
    height = rect.get_height()
    ax[0].text(rect.get_x() + rect.get_width()/2, height + 5, label, ha = "center", va = "bottom")
ax[1].set_title("Sale Price Distribution of Item Condition Ids", fontsize = 13)    
sns.boxplot(x = train_copy.item_condition_id, y = train_copy.price, showfliers = False, orient = "v", ax = ax[1], hue = train_copy.shipping, palette="Set1", linewidth=0.6)
plt.show()

"1" is the most widely used item_condition_id. Products having item_condition_ids "1" and "5" are costly. Interestingly, item_condition_ids with shipping flag "true" are having lesser price as compared to those with shipping flag "no".

#### [ Main Category ]

In [None]:
fig, ax = plt.subplots(2, 1, figsize = (13,18))
sns.countplot(train_copy.main_category, ax = ax[0], palette='Reds', edgecolor='k', linewidth=0.6)
ax[0].set_xticklabels(ax[0].get_xticklabels(),rotation=15)
ax[0].set_title("Count Distribution of Main Categories", fontsize = 13)
rectangles = ax[0].patches
labels = train_copy.main_category.value_counts().values
for rect, label in zip(rectangles, labels):
    height = rect.get_height()
    ax[0].text(rect.get_x() + rect.get_width()/2, height + 5, label, ha = "center", va = "bottom")
sns.boxplot(x = train_copy.main_category, y = train_copy.price, showfliers = False, orient = "v", ax = ax[1], hue = train_copy.shipping, palette="Set1", linewidth=0.6)
ax[1].set_xticklabels(ax[1].get_xticklabels(),rotation=15)
ax[1].set_title("Sale Price Distribution of Main Categories", fontsize = 13)
plt.show()

Main_category belonging to "Women", "Beauty" and "Kids" are the most frequently bought. Products belonging to "Electronics" and "Men" main categories are pricier than the other ones. Shipping flag "yes" are less costlier than shipping flag "no" products. 

#### [ Item Description ]

In [None]:
#python -m pip install wordcloud
from wordcloud import WordCloud
import os
wordcloud = WordCloud(width = 2400, height = 1200).generate(" ".join(train_copy.item_description.astype(str)))
plt.figure(figsize = (13, 10))
plt.imshow(wordcloud)
plt.show()

"Brand new", "free shipping", "great condition", "good condition", "never worn", "never used", 
"Victoria Secret", "smoke free", "Size large", "Size medium", "Size small", "excellent condition" 
are some frequently appearing item description texts.

### Log-Transformation of Target Variable :

In [None]:
train_copy['price'] = np.log1p(train_copy['price'])

In [None]:
fig, ax = plt.subplots(2, 1, figsize = (13, 12))
ax[0].hist(train_copy.price, bins = 100, range = [min(train_copy.price), max(train_copy.price)+100], label = "price", color='red', alpha=0.7)
ax[0].annotate(' Outliers\n present\n till\n this point', xy=(max(train_copy.price), 100), xytext=(max(train_copy.price), 125000), arrowprops=dict(facecolor='black'), color='black')
ax[0].set_title("Histogram of Log(Price) Distribution", fontsize = 13)
ax[0].set_xlabel("Price", fontsize = 10)
ax[0].set_ylabel("Frequency ", fontsize = 10)

ax[1].set_title("Histogram of Log(Price) Distribution (Focused Mode)", fontsize = 13)
ax[1].hist(train_copy.price, bins = 100, range = [0, 8], label = "price", color='red', alpha=0.7)
ax[1].set_xlabel("Log(Price)", fontsize = 10)
ax[1].set_ylabel("Frequency ", fontsize = 10)

Now the transformed price distribution has taken the symmetric shape and can be said that it is following Normal Distribution. We can work with this.

### Correlation Analysis :

In [None]:
import numpy as np
mask = np.zeros_like(train_copy.corr().fillna(0), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(10,10))
sns.heatmap(train_copy.corr(), mask = mask, vmin = -1, annot = True, fmt='.1g', cmap = 'plasma', edgecolor='w', linewidth=0.6)
plt.suptitle(' Correlations Heat Map for all Attributes', fontsize=13)

'n_category_name' and 'n_main_category' are very highly correlated. Hence we will choose to keep 'n_main_category' in the predictive model discarding 'n_category_name'.

In [None]:
gc.collect()

## Data Modeling
We are using only one simple regression models here. We are not tuning hyperparameters for the boost models. The main goal of this work is to check the power of using advanced text pre-processing. So we will use only simple predictive models here. Later we will re-check the performance of same predictive models (i.e. Ridge, LGBM and XGB regressors) after using advanced text pre-processing and compare how much improvement is evident.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

r_data= train_copy[['item_condition_id','shipping','n_name','n_brand_name','n_main_category','n_subcat_1','n_subcat_2']]
X_train, x_test, Y_train, y_test = train_test_split(r_data, train_copy['price'], test_size=0.25, random_state=12345)

def run_model(model, X_train, Y_train, x_test, y_test, verbose = False):
    Y_train = Y_train[:, np.newaxis].ravel()
    model.fit(X_train, Y_train)
    y_predict = model.predict(x_test)
    mse = mean_squared_error(y_test,y_predict)
    r_sq = r2_score(y_test,y_predict)
    print("Mean Squared Error Value : "+"{:.2f}".format(mse))
    print("R-Squared Value : "+"{:.2f}".format(r_sq))
    return model, mse, r_sq

#### Model-1: Ridge Regression

In [None]:
from sklearn import linear_model
ridge_reg = linear_model.Ridge()
print("Ridge Regression")
print("----------------")
model_1, mse_1, r_sq_1 = run_model(ridge_reg, X_train, Y_train, x_test, y_test)

#### Model-2: LGBM Regression

In [None]:
import lightgbm
lgbm_reg = lightgbm.LGBMRegressor()
print("LGBM Regression")
print("---------------")
model_2, mse_2, r_sq_2 = run_model(lgbm_reg, X_train, Y_train, x_test, y_test)

#### Model-3: XGB Regression

In [None]:
import xgboost
xgb_params = {'n_estimators':500, 'max_depth':8}
xgb_reg = xgboost.XGBRegressor(**xgb_params)
print("XGBoost Regression")
print("------------------")
model_3, mse_3, r_sq_3 = run_model(xgb_reg, X_train, Y_train, x_test, y_test)

**Observations :** 

1. very high **MSE** (52%) and too low \\( R^2 \\) value (only 8%) for Ridge regression;
2. high **MSE** (35%) and low \\( R^2 \\) value (39%) for LGBM regression;
3. medium **MSE** (27%) and medium \\( R^2 \\) value (52%) for XGB regression.


### Improved Approach with Advanced Text Pre-Processing :

In [None]:
# Creating a set combining Train & Test data. Applying Count Vectorizer on combined set will help us to get the list of all possible words.
combined_data = pd.concat([train_data,test_data])

# Specify size of training set
train_size = len(train_data)

# Submission set containing only the test IDs
submission = test_data[['test_id']]

In [None]:
combined_data.shape

In [None]:
# Taking a fraction (10%) of combined data set for experimentation. Dropping train/test ids here
combined_frac = combined_data.sample(frac=0.1).reset_index(drop=True)

In [None]:
combined_frac.shape

The steps we will apply for ***advanced text pre-processing*** are:
1. Removing Puncuations
2. Removing Digits
3. Removing Stopwords
4. Changing to Lower-case words
5. Lemmatization or Stemming

In [None]:
from string import punctuation
punctuation

In [None]:
# Create a list of punctuation replacements
punctuation_symbols = []
for symbol in punctuation:
    punctuation_symbols.append((symbol, ''))

In [None]:
# Remove Punctuation
import string
def remove_punctuation(sentence: str) -> str:
    return sentence.translate(str.maketrans('', '', string.punctuation))

In [None]:
# Remove Digits
def remove_digits(x):
    x = ''.join([i for i in x if not i.isdigit()])
    return x

In [None]:
# Remove Stopwords
from nltk.corpus import stopwords

stop = stopwords.words('english')

def remove_stop_words(x):
    x = ' '.join([i for i in x.lower().split(' ') if i not in stop])
    return x

In [None]:
# Change to LowerCase Words
def to_lower(x):
    return x.lower()

In [None]:
# Segregating "category_name" into "category_main", "subcat_1", "subcat_2" like we did before 
def transform_category_name(category_name):
    try:
        main, sub1, sub2= category_name.split('/')
        return main, sub1, sub2
    except:
        return np.nan, np.nan, np.nan

train_data['category_main'], train_data['subcat_1'], train_data['subcat_2'] = zip(*train_data['category_name'].apply(transform_category_name))
cat_train = train_data[['category_main','subcat_1','subcat_2', 'price']]

In [None]:
gc.collect()

#### Item Description Analysis

In [None]:
# Remove Digits, Punctuation, Stopwords, Converting to Lower-case and See the Effect
combined_data.item_description = combined_data.item_description.astype(str)
descr = combined_data[['item_description', 'price']]
descr['count'] = descr['item_description'].apply(lambda x : len(str(x)))
descr['item_description'] = descr['item_description'].apply(remove_digits)
descr['item_description'] = descr['item_description'].apply(remove_punctuation)
descr['item_description'] = descr['item_description'].apply(remove_stop_words)
descr.head(20)

In [None]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
descr['item_description'] = descr['item_description'].apply(porter.stem)
descr.tail(20)

In [None]:
# Basic data imputation of missing values
def handle_missing_values(df):
    df['category_name'].fillna(value='missing', inplace=True)
    df['brand_name'].fillna(value='None', inplace=True)
    df['item_description'].fillna(value='None', inplace=True)

In [None]:
# Converts to Categorical Features 
def to_categorical(df):
    df['brand_name'] = df['brand_name'].astype('category')
    df['category_name'] = df['category_name'].astype('category')
    df['item_condition_id'] = df['item_condition_id'].astype('category')

In [None]:
handle_missing_values(combined_frac)
to_categorical(combined_frac)

In [None]:
handle_missing_values(combined_data)
to_categorical(combined_data)

In [None]:
gc.collect()

In [None]:
# Remove Digits, Punctuation, Stopwords, Converting to Lower-case for combined_frac
combined_frac.item_description = combined_frac.item_description.astype(str)
combined_frac['item_description'] = combined_frac['item_description'].apply(remove_digits)
combined_frac['item_description'] = combined_frac['item_description'].apply(remove_punctuation)
combined_frac['item_description'] = combined_frac['item_description'].apply(remove_stop_words)
combined_frac['item_description'] = combined_frac['item_description'].apply(to_lower)
combined_frac['name'] = combined_frac['name'].apply(remove_digits)
combined_frac['name'] = combined_frac['name'].apply(remove_punctuation)
combined_frac['name'] = combined_frac['name'].apply(remove_stop_words)
combined_frac['name'] = combined_frac['name'].apply(to_lower)
combined_frac.head()

In [None]:
# Remove Digits, Punctuation, Stopwords, Converting to Lower-case for combined_data
combined_data.item_description = combined_data.item_description.astype(str)
combined_data['item_description'] = combined_data['item_description'].apply(remove_digits)
combined_data['item_description'] = combined_data['item_description'].apply(remove_punctuation)
combined_data['item_description'] = combined_data['item_description'].apply(remove_stop_words)
combined_data['item_description'] = combined_data['item_description'].apply(to_lower)
combined_data['name'] = combined_data['name'].apply(remove_digits)
combined_data['name'] = combined_data['name'].apply(remove_punctuation)
combined_data['name'] = combined_data['name'].apply(remove_stop_words)
combined_data['name'] = combined_data['name'].apply(to_lower)
combined_data.head()

In [None]:
gc.collect()

### Applying CountVectorizer / TfidfVectorizer / LabelBinarizer

* CountVectorizer counts word frequencies. 
* TF-IDF Vectorizer gives more significance (puts more weights) on rare words, and less significance (puts lesser weights) on frequent words. 
* Label Binarizer converts labels into numeric representations for e.g. "A,B,C" -> [1,2,3]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer

In [None]:
# Apply Count Vectorizer to "name", this converts it into a sparse matrix 
cv = CountVectorizer(min_df=10)
X_name = cv.fit_transform(combined_data['name'])

In [None]:
# Apply Count Vectorizer to "category_name", this converts it into a sparse matrix
cv = CountVectorizer()
X_category = cv.fit_transform(combined_data['category_name'])

In [None]:
# Apply TFIDF to "item_description", 
tv = TfidfVectorizer(max_features=55000, ngram_range=(1, 2), stop_words='english')
X_description = tv.fit_transform(combined_data['item_description'])

In [None]:
# Apply LabelBinarizer to "brand_name"
lb = LabelBinarizer(sparse_output=True)
X_brand = lb.fit_transform(combined_data['brand_name'])

In [None]:
# vstack - adds rows
# hstack - adds columns
# csr_matrix - handles sparse matrix

from scipy.sparse import vstack, hstack, csr_matrix
X_dummies = csr_matrix(pd.get_dummies(combined_data[['item_condition_id', 'shipping']], sparse=True).values)

In [None]:
# Create the final sparse matrix combining everything together
sparse_merge = hstack((X_dummies, X_description, X_brand, X_category, X_name)).tocsr()

In [None]:
X_train_sparse = sparse_merge[:train_size]
X_test = sparse_merge[train_size:]

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3, shuffle=True, random_state=12345)
y = np.log1p(train_data['price'])
i = 0;
for train_indicies, valid_indicies in kf.split(X_train_sparse):
    X_train, y_train = X_train_sparse[train_indicies], y[train_indicies]
    X_valid, y_valid = X_train_sparse[valid_indicies], y[valid_indicies]

In [None]:
gc.collect()

In [None]:
def run_model_advText(model, X_train, y_train, X_valid, y_valid, verbose = False):
    model.fit(X_train, y_train)
    preds_valid = model.predict(X_valid)
    mse = mean_squared_error(y_valid,preds_valid)
    r_sq = r2_score(y_valid,preds_valid)
    print("Mean Squared Error Value : "+"{:.2f}".format(mse))
    print("R-Squared Value : "+"{:.2f}".format(r_sq))
    return model, mse, r_sq

In [None]:
ridge_reg = linear_model.Ridge(solver = "saga", fit_intercept=False)
print("Ridge Regression (After advanced Text Pre-processing)")
print("-----------------------------------------------------")
model_11, mse_11, r_sq_11 = run_model_advText(ridge_reg, X_train, y_train, X_valid, y_valid)

In [None]:
lgbm_reg = lightgbm.LGBMRegressor()
print("LGBM Regression (After advanced Text Pre-processing)")
print("----------------------------------------------------")
model_22, mse_22, r_sq_22 = run_model_advText(lgbm_reg, X_train, y_train, X_valid, y_valid)

In [None]:
xgb_params = {'n_estimators':500, 'max_depth':8}
xgb_reg = xgboost.XGBRegressor(**xgb_params)
print("XGB Regression (After advanced Text Pre-processing)")
print("---------------------------------------------------")
model_33, mse_33, r_sq_33 = run_model_advText(xgb_reg, X_train, y_train, X_valid, y_valid)

### Improvement in Model Performance after using Advanced Text Pre-Processing :
* **MSE** has now decreased for all models
* \\( R^2 \\) value has now increased for all models

However, for Ridge model, the improvement seems the highest. Hence, we will choose **Ridge** to apply on the test dataset.

In [None]:
mse_before = [mse_1, mse_2, mse_3]
r_sq_before = [r_sq_1, r_sq_2, r_sq_3]
mse_after = [mse_11, mse_22, mse_33]
r_sq_after = [r_sq_11, r_sq_22, r_sq_33]
model_data = {'Model': ['Ridge','LGBM','XGB'],
              'MSE_without_Advanced_TextProcessing': mse_before,
              'R_Square_without_Advanced_TextProcessing': r_sq_before,
              'MSE_with_Advanced_TextProcessing': mse_after,
              'R_Square_with_Advanced_TextProcessing': r_sq_after}
data_compare = pd.DataFrame(model_data)

import matplotlib.pyplot as plt

fig, (ax[0],ax[1]) = plt.subplots(1,2, figsize=(13,4))
ax[0]=data_compare.plot(kind='line', x='Model', y='MSE_without_Advanced_TextProcessing', color='DarkBlue', linewidth=0.7, marker='o', markersize=6, ax=ax[0])
ax[0]=data_compare.plot(kind='line', x='Model', y='MSE_with_Advanced_TextProcessing', secondary_y=False,color='Red', linewidth=0.7, marker='o', markersize=6, ax=ax[0])
ax[1]=data_compare.plot(kind='line', x='Model', y='R_Square_without_Advanced_TextProcessing', color='DarkBlue', linewidth=0.7, marker='^', markersize=6, ax=ax[1])
ax[1]=data_compare.plot(kind='line', x='Model', y='R_Square_with_Advanced_TextProcessing', secondary_y=False,color='Red', linewidth=0.7, marker='^', markersize=6, ax=ax[1])

ax[0].set_title("Improvement in MSE after Advanced Text Processing", fontsize = 13)
ax[0].set_ylabel("Mean Square Error")
ax[1].set_title("Improvement in R-Square after Advanced Text Processing", fontsize = 13)
ax[1].set_ylabel("R-Square Value")
plt.tight_layout()
plt.show()

### Topic Modelling and LDA :
A topic model examines a set of documents (or a simple text corpus) and discover the important topics based on the statistics of the words in each. The "topics" produced by topic modeling techniques are clusters of similar words ([Wikipedia](http://en.wikipedia.org/wiki/Topic_model)).  

Topic Modelling is an **unsupervised concept**. Topic models are also referred to as "probabilistic topic models", which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body. In the current age, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies ([Wiki](http://en.wikipedia.org/wiki/Topic_model)). 

LDA (Latent Dirichlet Allocation) is a generative statistical model. In the initialization stage, each word is assigned to a random topic. Iteratively, the algorithm goes through each word and reassigns the word to a topic taking into consideration what is the probability of a new word belonging to a topic and what is the probability of the document (or text corpus) to be generated by a topic ([NLP for Hackers](https://nlpforhackers.io/topic-modeling/)).

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

cvectorizer = CountVectorizer(max_features=20000,stop_words='english',lowercase=True)

# Fit it to dataset fraction
cvz = cvectorizer.fit_transform(combined_frac['item_description'])

# Initialize LDA Model with 10 Topics
lda_model = LatentDirichletAllocation(n_components=10,random_state=12345)

# Fit it to CountVectorizer Transformation
X_topics = lda_model.fit_transform(cvz)

# Define variables
n_top_words = 10
topic_summaries = []

# Get the topic words
topic_word = lda_model.components_

# Get the vocabulary from the text features
vocab = cvectorizer.get_feature_names()

# Display the Topic Models
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    print('Topic {}: {}'.format(i, ' | '.join(topic_words)))

### Visualizing LDA Topics :
You need to put "1","2","3",.........,"10" in place of "Selected Topic" and view top 30 most relevant terms for that particular topic in the right hand side. In the right hand side stacked bar graph, the sky blue bars represent overall term frequency and the red bars represent estimated term frequency within the selected topic.

In [None]:
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, cvz, cvectorizer, mds='tsne')
panel

### Apply Final Ridge Model on Test Dataset and Submission :

In [None]:
predictions = ridge_reg.predict(X_test)
submission["price"] = np.expm1(predictions)
submission.head()

In [None]:
submission.to_csv("submission.csv", index = False)