# Introduction
Hello there!
This notebook describes how to implement basic CatBoost regression.

## Exploratory Analysis
To begin we first importing everything we need, then we going to check our data.

In [None]:
import pandas as pd
import numpy as np
from sklearn import preprocessing, model_selection
import sklearn
from catboost import CatBoostRegressor
from sklearn.metrics import mean_absolute_error
import os
import matplotlib.pyplot as plt

There is 1 csv file in the current version of the dataset:


In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
data = pd.read_csv('/kaggle/input/new_data_99_06_03_13_04.csv', delimiter=',')


Lets check the data!

In [None]:
data.shape


In [None]:
data.head(10)

We see some NaN values, lets check how many NaNs we got.

In [None]:
len(data) - data.count()

Lets also check unique values of each column.

In [None]:
df = data[:].nunique()
df

We can see that about 63% of data in 'Владение' is missing. We going to drop this. Also we going to drop 'description' columns since in have only text description, that we can use in NLP but not in this case. Aswell we should drop those 1-7 NaNs wich are not very important given the volume of other data. 'Таможня' column got only 1 value, so its going to be dropped too.

In [None]:
data = data.drop(['Таможня', 'description'], axis='columns', inplace=False)
data = data.dropna()

Check the data for NaNs again and lets check the data types of columns.

In [None]:
len(data) - data.count()


In [None]:
data.dtypes

Now we can see - there are zero NaN values. Columns of data contains lot of objects. For various purposes (for example, to visualize correlations), we need to convert those objects to numbers. In this case we can use Lable Encoder, wich is pretty easy to implement.

In [None]:
le = preprocessing.LabelEncoder()
categorical_columns = data.columns[data.dtypes == 'object']

for column in categorical_columns:
    data[column] = le.fit_transform(list(data[column]))

Let's take a quick look at types now, again.

In [None]:
data.dtypes

Great! Now we can implement some correlation visualisation.

In [None]:
fig = plt.figure(figsize=(15,8))
ax1 = fig.add_subplot(111)
plt.imshow(data.corr(), cmap='hot', interpolation='nearest')
plt.colorbar()
labels = data.columns
ax1.set_xticks(np.arange(len(labels)))
ax1.set_yticks(np.arange(len(labels)))
ax1.set_xticklabels(labels,rotation=90, fontsize=10)
ax1.set_yticklabels(labels,fontsize=10)
plt.show()

We can see both negative and positive correlations between price and features like enginepower, mileage and etc. To tune the prediction model we should drop features with low correlation, maybe generate some new features...but this is base-lane notebook, so we try to train model with all those features and and we'll see what happens.

# Building the model
Lets build the model and try to predict price of cars with features we got.

In [None]:
predict = 'Price'

X = np.array(data.drop([predict], 1))
y = np.array(data[predict])

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.2)

model = CatBoostRegressor(learning_rate=0.5)
model.fit(x_train, y_train)
accuracy = model.score(x_test, y_test)
print('Accuracy of model:', accuracy)

predictions = model.predict(x_test)
mae = mean_absolute_error(predictions, y_test)
print("Mean Absolute Error:", mae)

## Conclusion
Without any model tuning or feature engineering we got 90-93 pepercents accuracy, wich is pretty good. You can achive way more better scores with those key items i mentioned before - feature engineering and tuning the model.
Feel free to comment and fork this notebook, stay safe.