## **Introduction**
This notebook will be helpful,if you would like to create a model and run anyway.
I'm begginer (and Japanese).But(Therefore?) codes below can give you first step.
The porpose of this notebook is almost minimum modeling, **NOT** detail EDA and **NOT** optimizing models.

1. Read .tsv.7z files 
2. Min Preprocessiing
3. Text Processing 
4. Conversion to sparse type
5. Ridge regression

keywords: sparse, tf-idf

## Let's get started!

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy.sparse import csr_matrix
from scipy.sparse import hstack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_log_error

import warnings
warnings.filterwarnings('ignore')

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### 1. Read .tsv.7z files

In [None]:
!apt-get install p7zip
!p7zip -d -f -k /kaggle/input/mercari-price-suggestion-challenge/train.tsv.7z
!p7zip -d -f -k /kaggle/input/mercari-price-suggestion-challenge/test.tsv.7z
!p7zip -d -f -k /kaggle/input/mercari-price-suggestion-challenge/sample_submission.csv.7z

In [None]:
%%time

train_df = pd.read_table('train.tsv')
test_df = pd.read_table('test.tsv')
print(train_df.shape, test_df.shape)

### 2. Min Preprocessing 

In [None]:
# Difine Root Mean Squared Logarithmic Error.
def get_rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(np.expm1(y_true), np.expm1(y_pred)))

**RMSLE** is calculated as
$$\epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }$$

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
test_df.head()

In [None]:
test_df.info()

In [None]:
# concat Train and Test data
all_data = pd.concat([train_df, test_df], sort=False)

In [None]:
all_data.item_condition_id.unique()

In [None]:
all_data.isnull().sum()

In [None]:
#"item_description" has 4 NaN, and fill "No description yet"
all_data["item_description"].fillna("No description yet", inplace=True)

#"brand_name" has 928,207 NaN, and fill "None"
all_data["brand_name"].fillna("None", inplace=True)

#"category_name" has 9,385 NaN, and fill "Other"
all_data["category_name"].fillna("Other", inplace=True)

### tf-idf vectorizing

"name" and "item_description" are written by human freely and difficult to deal with them as Category due to themselve words variety.

You can use ["TfidfVectorizer"](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer).

In [None]:
# fit_transform
tfidf_n = TfidfVectorizer()
tfidf_vec_name = tfidf_n.fit_transform(all_data['name'])

tfidf_d = TfidfVectorizer()
tfidf_vec_desc = tfidf_d.fit_transform(all_data['item_description'])

In [None]:
tfidf_vec_name.shape

In [None]:
tfidf_vec_desc.shape

In [None]:
type(tfidf_vec_name)

### 3. Conversion to **sparse** type 

**tf-idf** conversion result in **csr_matrix** type like above.
We need convert other columns("item_condition_id", "category_name", "brand_name", "shipping") to **csr_matrix** with [scipy.sparse.csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html)

Before conversion them, one-hot encoding is needed.

In the following, each variable is prepared, but there is no problem if you put it in one variable with the **for** statement.
Here, it is written in an easy-to-understand manner using a method called **hstack**, which will be used later.

In [None]:
csr_cond = csr_matrix(pd.get_dummies(all_data["item_condition_id"]))
csr_cat_name = csr_matrix(pd.get_dummies(all_data["category_name"]))
csr_brand_name = csr_matrix(pd.get_dummies(all_data["brand_name"]))
csr_ship = csr_matrix(pd.get_dummies(all_data["shipping"]))

In [None]:
type(csr_cond)

In [None]:
# Combining sparse explanatory variables with scipy.sparse.hstack
all_data_X = hstack((tfidf_vec_name, tfidf_vec_desc, csr_brand_name, csr_cat_name, csr_cond, csr_ship)).tocsr()

# "price" is target veriable
all_data_y = all_data["price"]

In [None]:
# split train and test data 
train_X = all_data_X[:len(train_df)]
train_y = all_data_y[:len(train_df)]
test_X = all_data_X[len(train_df):]
print("train_X.shape:", train_X.shape, "\ntest_X.shape:", test_X.shape)

### **"price"** , target veriable **log-transformation**

In good practise, we check this plot at first EDA

In [None]:
sns.distplot(train_df["price"])

In [None]:
sns.distplot(np.log1p(train_df["price"]))

In [None]:
train_y = np.log1p(train_y)

## Ridge regression

In [None]:
# check the validation score
X_train, X_val, y_train, y_val = train_test_split(train_X, train_y, test_size=0.2, random_state=2021)

ridge = Ridge()
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_val)
get_rmsle(y_val, y_pred)

### Predict to test data

In [None]:
ridge = Ridge()
ridge.fit(train_X, train_y)
test_pred = np.expm1(ridge.predict(test_X)) # Inverse transformation

In [None]:
sub = pd.read_csv('sample_submission.csv')
sub["price"] = test_pred

sub.to_csv('submission.csv', index=False)

In [None]:
sub.head()