# Mercari Price Suggestion Challenge - Part 1 - Exploratory Analysis

This notebook goes through the steps taken by me to explore the Mercari price suggestion challenge data. The objective of this analysis would be to uncover any preliminary insights which could be strong indicators of product prices. The item names and text descriptions have been ignored in this analysis, and the focus is mostly on finding any significant impact of other features on prices. 

In [None]:
# Importing necessary packages
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import os

import re

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error 

In [None]:
# Read Training Data
trainData = pd.read_csv('../input/train.tsv',sep='\t',na_values={'brand_name':'NaN'})


## Loading Dataset and Brief Exploration

In [None]:
# Looking up for basic information of the dataset
display(trainData.dtypes)
display(trainData.info())

# Viewing the first 10 rows of the dataset
display(trainData.head(10))

# Looking at distribution of price
display(trainData['price'].describe())
sns.distplot(trainData['price'])
plt.title('Histogram of prices')
plt.show()
plt.close()

Some columns have missing values. The number of missing values are summarized as follows:

|    Column Name   | Number of Missing Values |
|:----------------:|:------------------------:|
|   category_name  |           6,327          |
|    brand_name    |          632,682         |
| item_description |          82,493          |

The next section talks about some data pre-processing steps which can be applied before starting the exploratory analysis.

The training dataset consists of around 1.4 million rows with 8 columns. 

We get the following information from looking at the first few rows of the dataset:

1. The 'item_condition_id' and 'shipping' columns are represented as numbers, but are in fact categorical. Thus, these should be converted into categorical variables.

2. The 'category_name' column seems to consist of three different labels, separated by a '/', most probably representing a root category and further sub-categories. It would be interesting to split this column into three different columns and study the effect of each column on the price.

3. Some items have missing brand names and item descriptions. It would be worthwhile to look at the effect of the absence of brand names and item description on the item price.

4. The distribution of price is pretty skewed, with the minimum value being 0, maximum being 2000 and the mean and median price as 26.7 and 17 respectively. Thus, it would be better to look at log of prices so as to yield a more normal distribution of prices. Also, 0 price does not make sense as it is an online marketplace and it might add noise to our model. Let's take a look at the number or rows with zero prices.



In [None]:
display(sum(trainData['price']==0))

display(trainData[trainData['price']==0].head())

We see that there are 874 rows with price 0, which is a very tiny proportion of the full dataset. Thus, it is safe to remove these rows.

## Data Pre-Processing

This section defines a function to pre-process our dataset based on the insights obtained from the previous section. The following steps were performed:

1. Splitting the 'category_name' column into three different columns, namely, 'category1', 'category2', 'category3'

2. Converting 'category1', 'category2', 'category3', 'item_condition_id', 'shipping' columns into categorical data type

3. Create two categorical columns 'brand_name_present' and 'item_description_present' to denote whether the brand name and item description is present for an item

Additionally, another column 'log_price', containing the log<sub>10</sub> of prices, was created to make the distribution of prices normal.

In [None]:
# Define a function to perform pre-processing
def dataPreprocess(input_data, train = True):
    data = input_data.copy()


    ## Creating individual categories from category_name
    categoryNames = data['category_name'].str.split('/',expand = True)

    data['category1'] = categoryNames[0]
    data['category2'] = categoryNames[1]
    data['category3'] = categoryNames[2]
        
    ## Converting item_condition_id, shipping and category_name to categorical variables
    data['shipping'] = pd.Categorical(['Free' if x==1 else 'Paid' for x in data['shipping']])
    data['item_condition_id'] = pd.Categorical(data['item_condition_id'])
    data['category1'] = pd.Categorical(['No category name present' if x!=x else x for x in data['category1']])
    data['category2'] = pd.Categorical(['No category name present' if x!=x else x for x in data['category2']])
    data['category3'] = pd.Categorical(['No category name present' if x!=x else x for x in data['category3']])
    data['brand_name_present'] = pd.Categorical(['Yes' if x==False else 'No' for x in data['brand_name'].isnull()])
    data['item_description_present'] = pd.Categorical(['Yes' if x==False else 'No' for x in data['item_description'].isnull()])
    
    ## Creating a column storing log (base 10) of prices
    if train==True:
        ## Dropping rows with 0 prices
        data = data[data['price']!=0]
        data['log_price'] = np.log(data['price']+1)
        return data
    else:
        return data
    
    


In [None]:
trainDataProcessed = dataPreprocess(trainData)
display(trainDataProcessed.head())
display(trainDataProcessed.dtypes)
display(trainDataProcessed.info())
display(sum(trainDataProcessed['price']==0))

The pre-processing script seems to have worked correctly, as can be seen from the first 5 rows of the dataset. Next section talks about exploratory analysis on this pre-processed dataset.

## Exploratory Analysis

### Category Analysis
First, let's check the number of unique cateogries for each level of category (1/2/3). 

In [None]:
print(set(trainDataProcessed['category1']))
print('Number of unique Category 1 Labels: '+str(len(set(trainDataProcessed['category1'])))) # Accounting for missing values
print('Number of unique Category 2 Labels: '+str(len(set(trainDataProcessed['category2'])))) # Accounting for missing values
print('Number of unique Category 3 Labels: '+str(len(set(trainDataProcessed['category3'])))) # Accounting for missing values


We see that there are 10 major catgories, which are divided into 113 different types of sub-categories which are further broken down into 870 different sub-categories. We next see whether the categories have an effect on prices. There are some product listings without any category name. Let's take a look at the number of such rows:

In [None]:
print('Category 1 Missing Values: ',trainDataProcessed['category1'].isna().sum())
print('Category 2 Missing Values: ',trainDataProcessed['category2'].isna().sum())
print('Category 3 Missing Values: ',trainDataProcessed['category3'].isna().sum())

The number of rows missing a category label is very small (~0.4 %) and for now, I will let these rows remain in the dataset.

In [None]:
fig, ax = plt.subplots(figsize=(50, 20))
ax = sns.boxplot(x='category1',y='log_price', data=trainDataProcessed)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90, fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.title('Boxplot of Log Prices wrt Category 1',fontsize=26)
plt.show()
plt.close()

fig, ax = plt.subplots(figsize=(50, 20))
ax = sns.boxplot(x='category2',y='log_price', data=trainDataProcessed)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90, fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.title('Boxplot of Log Prices wrt Category 2',fontsize=26)
plt.suptitle('')
plt.show()
plt.close()

fig, ax = plt.subplots(figsize=(50, 20))
ax = sns.boxplot(x='category3',y='log_price', data=trainDataProcessed)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90, fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=12)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.title('Boxplot of Log Prices wrt Category 3',fontsize=26)
plt.suptitle('')
plt.show()
plt.close()

From the plots, the following insights can be drawn:

1. There seems to be a significant difference between price and category 1 labels.

2. Category 2 looks even more promising in terms of difference in prices across different category labels. However, some similar category labels exist as two different labels and thus these need to be combined together. This would require further data processing to combine these categories together.

3. The price is different across category 3 labels as well, but it is not possible to distinguish the different labels from the plot. It might be worthwhile to investigate further on how to club these labels together to get meaningful information from the category 3 labels

### Effect of shipping
Next, we see the effect of whether free shipping has an effect on prices or not.

In [None]:
fig, ax = plt.subplots(figsize=(50, 20))
ax=sns.boxplot(x='shipping',y='log_price', data=trainDataProcessed)
plt.title('Effect of free shipping on prices',fontsize=26)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.show()
plt.close()

Items with free shipping seem to have a lower price compared to paid shipping. This might imply that sellers are willing to pay the cost of shipping for smaller and lighter items compared to heavier or more expensive items. Thus, it may be worthwhile to look at the combined effect of category and shipping on prices

In [None]:
fig, ax = plt.subplots(figsize=(50, 20))
ax=sns.boxplot(x='category1',y='log_price',hue='shipping', data=trainDataProcessed)
plt.title('Effect of free shipping on prices across different categories',fontsize=26)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.show()
plt.close()

We see that the difference in price between paid and free shipping varies across different categories. For example, the electronics and home categories have a significantly higher difference between paid and free shipping prices compared to categories like 'Men' and 'Women'.

### Effect of item condition
Next , the effect of item condition on prices is explored.

In [None]:
fig, ax = plt.subplots(figsize=(50, 20))
ax=sns.boxplot(x='item_condition_id',y='log_price', data=trainDataProcessed)
plt.title('Effect of item condition  on prices',fontsize=26)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.show()
plt.close()

No significant differences were found across different item conditions. Next, let's investigate the effect of whether the presence of brand name and item description has an effect on prices or not.

In [None]:
fig, ax = plt.subplots(figsize=(50, 20))
ax=sns.boxplot(x='brand_name_present',y='log_price', data=trainDataProcessed)
plt.title('Effect of presence of brand names  on prices',fontsize=26)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.show()
plt.close()

In [None]:
fig, ax = plt.subplots(figsize=(50, 20))
ax=sns.boxplot(x='item_description_present',y='log_price', data=trainDataProcessed)
plt.title('Effect of presence of item description  on prices',fontsize=26)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.show()
plt.close()

Items with brand names have a slightly higher price than those without a brand name.

However, there seems to be a very small difference in prices between items with a description and item with no description.

Another interesting thing to explore would be to see the effect of length of description on the prices. For this, we create another column containing the word count of item description and plotting a scatter plot between prices and the word count.

In [None]:
def word_count(string):
    try:
        return len(re.findall("[a-zA-Z_]+", string))
    except TypeError:
        return 0

In [None]:
trainDataProcessed['word_count'] = [word_count(x) for x in trainDataProcessed['item_description']]
trainDataProcessed.head()



In [None]:
sns.scatterplot(x='word_count',y='log_price',data=trainDataProcessed)

No apparent relation was found between item description length and the price.

Thus, the only columns having some predicting power to determine the prices seems to be the category labels and free shipping. In order to make the model more accurate, we will need to extract features from the 'name' and 'item_description' columns.

## Basic Models

Before I go into building a more complex model, taking the item name and item description into account, I want to see how well a model can do with category and shipping columns as features to predict prices.

Since RMSLE (Root Mean Square Log Error) is being used as the final metric, I will be training the model on log of prices

### Model 1 - Linear Regression Model with Shipping, Category 1, Category 2 as features

In [None]:
xData = trainDataProcessed[['category1','category2','shipping','brand_name_present']]

xOneHotEncoded = pd.get_dummies(xData,
                                  columns = ['category1','category2','shipping','brand_name_present'],
                                  prefix= ['cat1','cat2','shipping','brand'])
yData = trainDataProcessed['log_price']

display(xData.head())
display(xOneHotEncoded.shape)
display(yData.head())

In [None]:
x_train, x_test, y_train, y_test = train_test_split(xOneHotEncoded,yData,test_size=0.2, random_state = 1)

In [None]:
modelRegression = LinearRegression()
modelRegression.fit(x_train,y_train)

### Model 2 - Linear Regression Model with Shipping, Category 1, Category 2, Min, Max and Median Prices for each combination of Category 1 and Category 2 as features

For this model, the following features were added:

1. The minimum price for each pair of category 1 and category 2
2. The maximum price for each pair of category 1 and category 2
3. The median price for each pair of category 1 and category 2

The idea behind this was to add features which might help the model to learn from the price distribution for items belonging to a particular category and sub-category.

In [None]:
minPrice = trainDataProcessed.groupby(['category1','category2'], as_index=False)['price'].min()
minPrice['price'] = np.log(minPrice['price']+1)
minPrice

maxPrice = trainDataProcessed.groupby(['category1','category2'], as_index=False)['price'].max()
maxPrice['price'] = np.log(maxPrice['price']+1)
maxPrice

medPrice = trainDataProcessed.groupby(['category1','category2'], as_index=False)['price'].median()
medPrice['price'] = np.log(medPrice['price']+1)
medPrice

trainDataModel2 = pd.merge(trainDataProcessed,minPrice,on=['category1','category2'],suffixes=('','_min'))
trainDataModel2 = pd.merge(trainDataModel2,maxPrice,on=['category1','category2'],suffixes=('','_max'))
trainDataModel2 = pd.merge(trainDataModel2,medPrice,on=['category1','category2'],suffixes=('','_med'))
trainDataModel2

In [None]:
xData2 = trainDataModel2[['category1','category2','price_min','price_max','price_med','shipping','brand_name_present']]

xOneHotEncoded2 = pd.get_dummies(xData2,
                                  columns = ['category1','category2','shipping','brand_name_present'],
                                  prefix= ['cat1','cat2','shipping','brand'])
yData2 = trainDataProcessed['log_price']

display(xData2.head())
display(xOneHotEncoded2.shape)
display(yData2.head())

In [None]:
x_train2, x_test2, y_train2, y_test2 = train_test_split(xOneHotEncoded2,yData2,test_size=0.2, random_state = 1)

In [None]:
modelRegression2 = LinearRegression()
modelRegression2.fit(x_train2,y_train2)

### Model 3 - Gradient Boost Regression with Shipping, Category 1 and Categroy 2 as features

The third model was again built on the same three features as the first model, but this time, I used a gradient boost regression model to predict the prices. I used the default hyper-parameter settings for the model to compare the performance against the first two models.

In [None]:
xData3 = trainDataProcessed[['category1','category2','shipping','brand_name_present']]

xOneHotEncoded3 = pd.get_dummies(xData,
                                  columns = ['category1','category2','shipping','brand_name_present'],
                                  prefix= ['cat1','cat2','shipping','brand'])
yData3 = trainDataProcessed['log_price']

x_train3, x_test3, y_train3, y_test3 = train_test_split(xOneHotEncoded3,yData3,test_size=0.2, random_state = 1)

modelRegression3 = GradientBoostingRegressor()
modelRegression3.fit(x_train3,y_train3)

## Analysis of results

Let's first compare the RMSLE scores on the training as well as validation set for the three models.

In [None]:
y_test1 = np.asarray(y_test)
y_test1 = y_test1.reshape(-1,1)

y_train1 = np.asarray(y_train)
y_train1 = y_train1.reshape(-1,1)

y_test12 = np.asarray(y_test2)
y_test12 = y_test12.reshape(-1,1)

y_train12 = np.asarray(y_train2)
y_train12 = y_train12.reshape(-1,1)

y_test13 = np.asarray(y_test3)
y_test13 = y_test12.reshape(-1,1)

y_train13 = np.asarray(y_train3)
y_train13 = y_train12.reshape(-1,1)

In [None]:
print('RMSLE for Model 1 on Training Set:',mean_squared_error(modelRegression.predict(x_train),y_train1))
print('RMSLE for Model 1 on Validation Set:',mean_squared_error(modelRegression.predict(x_test),y_test1))
      
print('RMSLE for Model 2 on Training Set:',mean_squared_error(modelRegression2.predict(x_train2),y_train12))
print('RMSLE for Model 2 on Validation Set:',mean_squared_error(modelRegression2.predict(x_test2),y_test12))

print('RMSLE for Model 3 on Training Set:',mean_squared_error(modelRegression3.predict(x_train3),y_train13))
print('RMSLE for Model 3 on Validation Set:',mean_squared_error(modelRegression3.predict(x_test3),y_test13))

The results are tabulated below:

|                       Model                      | Training Set RMSLE | Validation Set RMSLE |
|:------------------------------------------------:|:------------------:|----------------------|
|            Model 1 - Linear Regression           |       0.45774      |        0.45769       |
| Model 2 - Linear Regression with  extra features |       0.55696      |        0.55749       |
|             Model 3 - GBM Regression             |       0.44897      |        0.44932       |

From the results, it can be seen that Models 1 and 3 do considerably better than Model 2. Thus, the extra features added did not make any improvement in the model. Plotting the distribution of the predicted results and visualizing it against the actual prices may help us evaluate our model performance in a better way.

In [None]:
fig, ax = plt.subplots(figsize=(50, 20))
sns.distplot(y_train, label = 'Actual')
sns.distplot(modelRegression.predict(x_train), label = 'Predicted')
plt.title('Histogram of Actual vs Predicted Prices for Model 1 on Training Set', fontsize = 26)
plt.legend()
plt.setp(ax.get_legend().get_texts(), fontsize='22')
plt.setp(ax.get_legend().get_title(), fontsize='32')
ax.set_xticklabels(ax.get_xticklabels(), fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.show()
plt.close()

fig, ax = plt.subplots(figsize=(50, 20))
sns.distplot(y_test, label = 'Actual')
sns.distplot(modelRegression.predict(x_test), label = 'Predicted')
plt.title('Histogram of Actual vs Predicted Prices for Model 1 on Validation Set', fontsize = 26)
plt.legend()
plt.setp(ax.get_legend().get_texts(), fontsize='22')
plt.setp(ax.get_legend().get_title(), fontsize='32')
ax.set_xticklabels(ax.get_xticklabels(), fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.show()
plt.close()

In [None]:
fig, ax = plt.subplots(figsize=(50, 20))
sns.distplot(y_train2, label = 'Actual')
sns.distplot(modelRegression2.predict(x_train2), label = 'Predicted')
plt.title('Histogram of Actual vs Predicted Prices for Model 2 on Training Set',fontsize = 26)
plt.legend()
plt.setp(ax.get_legend().get_texts(), fontsize='22')
plt.setp(ax.get_legend().get_title(), fontsize='32')
ax.set_xticklabels(ax.get_xticklabels(), fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.show()
plt.close()

fig, ax = plt.subplots(figsize=(50, 20))
sns.distplot(y_test2, label = 'Actual')
sns.distplot(modelRegression2.predict(x_test2), label = 'Predicted')
plt.title('Histogram of Actual vs Predicted Prices for Model 2 on Validation Set',fontsize = 26)
plt.legend()
plt.setp(ax.get_legend().get_texts(), fontsize='22')
plt.setp(ax.get_legend().get_title(), fontsize='32')
ax.set_xticklabels(ax.get_xticklabels(), fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.show()
plt.close()

In [None]:
fig, ax = plt.subplots(figsize=(50, 20))
sns.distplot(y_train3, label = 'Actual')
sns.distplot(modelRegression3.predict(x_train3), label = 'Predicted')
plt.title('Histogram of Actual vs Predicted Prices for Model 3 on Training Set',fontsize = 26)
plt.legend()
plt.setp(ax.get_legend().get_texts(), fontsize='22')
plt.setp(ax.get_legend().get_title(), fontsize='32')
ax.set_xticklabels(ax.get_xticklabels(), fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.show()
plt.close()

fig, ax = plt.subplots(figsize=(50, 20))
sns.distplot(y_test3, label = 'Actual')
sns.distplot(modelRegression3.predict(x_test3), label = 'Predicted')
plt.title('Histogram of Actual vs Predicted Prices for Model 3 on Validation Set',fontsize = 26)
plt.legend()
plt.setp(ax.get_legend().get_texts(), fontsize='22')
plt.setp(ax.get_legend().get_title(), fontsize='32')
ax.set_xticklabels(ax.get_xticklabels(), fontsize=20)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=24)
ax.set_ylabel(ax.get_ylabel(), fontsize=24)
plt.show()
plt.close()

From the histograms, the following observations can be made:

1. All the models fail to capture the tails of the price distribution.
2. Model 2, in particular is predicting values in a very small range of log(prices). Thus, it performs poorly against the other two models.

## Predicting on final test data

In [None]:
testData = pd.read_csv('../input/test_stg2.tsv',sep='\t',na_values={'brand_name':'NaN','item_description':'No description yet'})


In [None]:
testDataProcessed = dataPreprocess(testData, train = False)

In [None]:
testDataProcessed = pd.merge(testDataProcessed,minPrice,on=['category1','category2'],suffixes=('','_min'))
testDataProcessed = pd.merge(testDataProcessed,maxPrice,on=['category1','category2'],suffixes=('','_max'))
testDataProcessed = pd.merge(testDataProcessed,medPrice,on=['category1','category2'],suffixes=('','_med'))

testDataProcessed.rename(columns={'price':'price_min'},inplace=True)
display(testDataProcessed.head())
## Checking whether categories in the train and test set are the same or not

print('Category 1 is same for train and test sets?: ',set(testDataProcessed['category1'])==set(trainDataProcessed['category1']))
print('Category 1 is same for train and test sets?: ',set(testDataProcessed['category2'])==set(trainDataProcessed['category2']))
print('Category 1 is same for train and test sets?: ',set(testDataProcessed['category3'])==set(trainDataProcessed['category3']))

### Model 1 Results

In [None]:
testXData = testDataProcessed[['category1','category2','shipping','brand_name_present']]
xTestDataHotEncoded = pd.get_dummies(testXData,
                                  columns = ['category1','category2','shipping','brand_name_present'],
                                  prefix= ['cat1','cat2','shipping','brand'])
predictedPrices = modelRegression.predict(xTestDataHotEncoded)

In [None]:
predictedPrices = np.exp(predictedPrices)+1

In [None]:
submission1 = pd.DataFrame({'test_id':testDataProcessed['test_id'],
                          'price':predictedPrices})
submission1.to_csv('submission1.csv',index=False)

This submission leads to a stage 2 test score of 0.68914. The goal would be to further improve upon this score.

### Model 2 Results

In [None]:
testXData2 = testDataProcessed[['category1','category2','price_min','price_max','price_med','shipping','brand_name_present']]
xTestDataHotEncoded2 = pd.get_dummies(testXData2,
                                  columns = ['category1','category2','shipping','brand_name_present'],
                                  prefix= ['cat1','cat2','shipping','brand'])

predictedPrices2 = modelRegression2.predict(xTestDataHotEncoded2)
predictedPrices2 = np.exp(predictedPrices2)+1

submission2 = pd.DataFrame({'test_id':testDataProcessed['test_id'],
                          'price':predictedPrices2})
submission2.to_csv('submission2.csv',index=False)

This submission led to a score of 0.75651

### Model 3 Results

In [None]:
predictedPrices3 = modelRegression3.predict(xTestDataHotEncoded)
predictedPrices3 = np.exp(predictedPrices3)+1
submission3 = pd.DataFrame({'test_id':testDataProcessed['test_id'],
                          'price':predictedPrices3})
submission3.to_csv('submission3.csv',index=False)
submission3.head()

This model led to a score of 0.68450

Possible avenues of improvement are:

1. Extract features from item name and item description
2. Using regression algorithms other than linear regression