# Data Analysis for Software Engineers

## Practical Assignment 4
## Getting Ready For Competition

<hr\>
**General Information**

**Due date:** 29 April 2018, 23:59 <br\>
**Competition deadline date:** 30 May 2018, 23:59 <br\>
**Competition link:** [here](https://www.kaggle.com/t/6d3fc375fd254010a1e781f91d6f6fc9)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12,8)

# Load datasets

Load dataset. Get ready to find out that files are heavy..

You can find feature description on competition web-page

In [None]:
df_train = pd.read_csv('train_kaggle.csv.gz', sep=';', compression='gzip', encoding='utf8')

In [None]:
df_test = pd.read_csv('test_kaggle.csv.gz', sep=';', compression='gzip',  encoding='utf8')

# Prepare dataset

## Target features transformation (1 point)

Look at target feature disctibution. 

In [None]:
# Your Code Here

One might notive that it is heavy-tailed. Usually some transformation must be applied to provide better regression results.

Consider various transformations, like `np.log(x+1)`, `np.sqrt(x)` and etc.

Which of those provide better results? Make that transformation.

Dont forget to run inversed transformation during submission file preparation

In [None]:
# Your Code Here

## Raw Feature Preparation (1 point)

Our baseline model would consist of features `subcategory` and `description`.

First of all, we need go slighly polish them.

### Subcategory

Are there any difference between unique subcategory id in train and test? Show it.

In [None]:
# Your Code Here

* Find out the union of subcategories from train and test. Assign it to some variable
* Initiallize `LabelEncoder`
* Fit `LabelEncoder` to it.
* Use `LabelEncoder` to map initial subcategories ids to number of 0 to C-1

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# Your Code Here

### Description

Description field is just a test. Sometimes it is missing (na). You should fill missing values with empty sting.

In [None]:
# Your Code Here

### Train and Test data

In [None]:
X = df_train.loc[:, ['subcategory_new', 'description']].values
y = df_train.loc[:, 'price'].values

X_test = df_test.loc[:, ['subcategory_new', 'description']].values
y_test = df_test.loc[:, 'price'].values

# Base pipeline (1 point)

We are going to build a base pipeline, although one can find it not that simple

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDRegressor

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, col_idx):
        self.col_idx = col_idx
    
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[:, self.col_idx]

In [None]:
feature_preproc = FeatureUnion([
    ('cat_preproc', Pipeline(
        [
            ('select', ColumnSelector([0])),
            ('ohe', OneHotEncoder(handle_unknown='ignore'))
        ])),
    ('text_preproc', Pipeline(
        [
            ('select', ColumnSelector(1)),
            ('vect', TfidfVectorizer(min_df=20, max_df=0.9)),
        ]))
])

In [None]:
model = Pipeline([
    ('preproc', feature_preproc),
    ('clf', SGDRegressor(random_state=123, max_iter=50))
])

Descripe what is going on in this pipeline.

To understand what `FeatureUnion` is look [here](http://michelleful.github.io/code-blog/2015/06/20/pipelines/)

In [None]:
# Your Code Here

## Training and Preparing submission

Train model and upload your submission

In [None]:
%%time
model.fit(X, y)

In [None]:
y_hat = model.predict(X_test)
y_hat = your_inverse_transformation(y_hat)

In [None]:
df_submission = pd.DataFrame(index=df_test.loc[:, 'id'], data=y_hat, columns=['price']).reset_index()

In [None]:
df_submission.to_csv('my_base_submission.csv', sep=',', index=None)