# Car Forecast with Multi Class Classification
## Araba-Oner 

**Problem:** I have 20 brand-name classes. we are trying to cluster them according to the answers that people give to 12 questions and return them to the most suitable car models.

**Project:** 12 Soruda To recommend the closest car models to the user.

**Goal:** To remove 5 clusters from the segment. In total you will have to predict over the average 400-600 models and suggest 10 Models that are best for you.

### **This project was done by the DataRaccoons Team.**
#### **DataRaccoons:** [Web](https://www.dataraccoons.com/) / [Kaggle](https://www.kaggle.com/dataraccoons)

**August 2018 - DataRaccoons**

**Real Time Web Site : [www.arabaoner.com](https://www.arabaoner.com/)**


- **Dictionary For Columns:** 
    - araba-tur : Used/New
    - airbag : Airbag İmportance Score(1-5)
    - araba-yas : Car Age
    - araba-performans : High Condition- High Fuel Consumption / Standart Condition - Standart Fuel Consumption
    - araba-kullanım-tur : Long Time Use / Buy For Sell
    - araba-km : mileage range of  car
    - araba-yakıt : Fuel Type (Gasoline, Gas, Diesel)
    - araba-segment : Car Segment, (A,B,C,D,E...)
    - araba-parca : Robust and expensive parts / Low cost, poor quality parts 
    - arac-hitap : Car Use Type (Family, Yourself, Job, Whatever)
    - butce : Car Budget
    - konfor-skorlama : Car Comfort Score(1-5)
    - araba-model : Car Models
    - marka : Car Brands
    - kume : Clusted Data

# Introduction
1. [Data Pre-Processing](#ch0)
 - [Cleaning](#ch1)
 - [Mapping](#ch2)
 - [Creating the DataSet for target value](#ch3)
 - [Min Max Scale](#ch4)
 - [PCA(Feature Selection)](#ch5)
       - [Build PCA](#ch6)
       - [PCA Model Scores](#ch7)
2. [Data Visualization](#ch8)
 - [Feature, Brand Correlations](#ch9)
 - [Feature, Brand Graphs](#ch10)
3. [Building Machine Learning Model](#ch11)
4. [Network Analysis](#ch12)
 - [Features Network Analysis](#ch13)
 - [Brands Network Analysis](#ch14)
5. [Clustering with hierarchical clustering](#ch15)
 - [Clustering Data](#ch16)
 - [Clustered Data Correlations](#ch17)

## 1- Data Pre-Processing
<a id="ch0"></a>

**Pre-Processing is an important part of a model, if you can not make the right moves in this part, your model can not be built or stabilized**

### Cleaning
<a id="ch1"></a>

In [None]:
#Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
#import Dataset
data = pd.read_csv('../input/dataset.csv')
data3 = data.loc[:,['kume']].values

In [None]:
#rename columns and drop not importance axis
names = ["araba-tur","airbag","araba-yas","araba-performans","araba-kullanım-tur","araba-km",
         "araba-yakıt","araba-segment","araba-parca","arac-hitap",
         "butce","konfor-skorlama","araba-model","marka","sahibinden-link","arabam-link","kume"]

names2 = ["araba-tur","airbag","araba-yas","araba-performans","araba-kullanım-tur","araba-km",
         "araba-yakıt","araba-segment","araba-parca","arac-hitap",
         "butce","konfor-skorlama","marka","kume"]

data = data.rename(columns=dict(zip(data.columns, names)))

data = data.drop(['sahibinden-link'], axis=1)
data = data.drop(['arabam-link'], axis=1)
data = data.drop(['araba-model'], axis=1)
data.head(5)

### Mapping
<a id="ch2"></a>

In [None]:
#Creating Mapping Dictionaries
mapping_tur = {"2. El":0, "Sıfır":1}
mapping_yas = {"0-1":1, "1-3":2, "3-5":3, "5-8":4, "8-10":5, "10-12":6, "12+":7}
mapping_performans = {"Vasat Performans, Az Yakması":0, "Yüksek Performans, Çok Yakması":1, "Standart Performans, Az Yakması":0}
mapping_kullanım = {"Uzun Süre Binme Odaklı":0, "Satıp Para Kazanma Odaklı":1}
mapping_km = {"0-25.000":1, "0-25.002":1, "25.000-50.000":2, "50.000-100.000":3, "100.000-200.000":4, "200.000+":5}
mapping_yakıt = {"LPG":0, "Dizel":1, "Benzinli":2, "Farketmez":3}
mapping_segment = {"A Segmenti (Ekonomik Az Yakanlar, i10)":0, "B Segmenti (Hyundai Getz, Polo)":1, "C Segmenti (Honda Civic, Renault Fluence)":2,
                  "D Segmenti (Mercedes C Serisi, VW Passat, Ford Mondeo)":3, "E Segmenti (BMW 5 serisi, Volvo s80)":4, "F Segmenti (Audi A8, BMW 7 serisi)":5,
                  "G Segmenti (Porshce 911)":6, "J Segmenti (4x4 Jipler vs.)":7, "D Segmenti (Mercedes C Serisi, VW Passat)":3}
mapping_parca = {"Sürekli Sorun Çıkarsın Ucuz Parçaları Olsun":0, "Parçalar Sağlam ve Pahalı Olsun, Az Sorun Çıkarsın.":1, "Arada Bir Sorun Çıkarsın, Ucuz Parçaları Olsun":0}
mapping_hitap = {"Aile Aracı":0, "Ticari":1, "Şahıs Aracı":2, "Off Road":3}
mapping_butce = {"0-15.000":0, "15.000-25.000":1, "25.000-35.000":2, "35.000-45.000":3, "45.000-55.000":4, "55.000-65.000":5,
                "65.000-75.000":6, "75.000-85.000":7, "85.000-100.000":8, "100.000-200.000":9, "200.000+":10}
mapping_marka = {"Alfa Romeo":0, "Audi":1, "Bmw":2, "Chevrolet":3, "Citroen":4, "Dacia":5, "Fiat":6, "Ford":7, "Honda":8,
                "Hyundai":9, "Kia":10, "Mercedes":11, "Mitsubishi":12, "Nissan":13, "Opel":14, "Peugeot":15, "Porsche":16,
                "Renault":17, "Toyota":18, "Volkswagen":19, "Volvo":20, "Skoda":21, "Mazda":22, "Mini":23, "Land Rover":24,
                "Seat":25}

#Mapping on Data
data['araba-tur'] = data['araba-tur'].map(mapping_tur)
data['araba-yas'] = data['araba-yas'].map(mapping_yas)
data['araba-performans'] = data['araba-performans'].map(mapping_performans)
data['araba-kullanım-tur'] = data['araba-kullanım-tur'].map(mapping_kullanım)
data['araba-km'] = data['araba-km'].map(mapping_km)
data['butce'] = data['butce'].map(mapping_butce)
data['araba-yakıt'] = data['araba-yakıt'].map(mapping_yakıt)
data['araba-segment'] = data['araba-segment'].map(mapping_segment)
data['araba-parca'] = data['araba-parca'].map(mapping_parca)
data['arac-hitap'] = data['arac-hitap'].map(mapping_hitap)
data['marka'] = data['marka'].map(mapping_marka)

data8=data
data9=data
data11 = data['marka']
data.head(5)

### Creating Dataset for Target Value
<a id="ch3"></a>

In [None]:
#I going to define some funcs for Clustering DataSet base on brand

def plotData(data, marka = None):
    if marka != None: 
        data = data[(data.marka == mapping_marka[marka])]
        print("Opinion of ", marka)
    fig = plt.figure(figsize=(25,10))
    
    names = ["araba-tur","airbag","araba-yas","araba-performans","araba-kullanım-tur","araba-km",
         "araba-yakıt","araba-segment","araba-parca","arac-hitap",
         "butce","konfor-skorlama","marka","kume"]
    
    for i in range(0,13):
        p1 = fig.add_subplot(2,5,i-3)
        data[names[i]].value_counts().plot(kind = 'pie', autopct='%.1f%%'); 
        plt.ylabel(" ", fontsize = 15)
        plt.title(Q[i-4])
    plt.grid()
    plt.savefig(marka)
    plt.savefig(marka + ".pdf")
    
def getOpinion(data, marka = None):
    if marka != None: 
        data = data[(data.marka == mapping_marka[marka])]
    return [data[col].mean() for col in names2[0:]]

opinions = dict()
for k in mapping_marka.keys():
    opinions[k] = getOpinion(data, marka = k)

df = pd.DataFrame.from_dict(opinions)
df.rename(index = dict(zip(range(len(names2[0:])),names2[0:])),inplace=True)
df = df.reindex(columns=['Alfa Romeo', 'Audi', 'Bmw', 'Chevrolet', 'Citroen', 'Dacia', 'Fiat', 'Ford', 'Honda', 'Hyundai',
                        'Kia', 'Mercedes', 'Mitsubishi', 'Nissan', 'Opel', 'Peugeot', 'Renault', 'Toyota',
                        'Volkswagen', 'Volvo'])
df.T

### MİN - MAX SCALE
** I will now do the scale mechanism to better understand the properties of the cars in correlations between cars**
<a id="ch4"></a>

In [None]:
from sklearn.preprocessing import MinMaxScaler

data9 = data9.drop(['kume'], axis=1)
data9 = data9.drop(['marka'], axis=1)

scaler = MinMaxScaler(feature_range = (0,1))
scaler.fit(data9)
data9 = scaler.transform(data9)
data9 = pd.DataFrame(data9)

data9 = pd.concat([data9, data11], axis = 1)

names = ["araba-tur","airbag","araba-yas","araba-performans","araba-kullanım-tur","araba-km",
         "araba-yakıt","araba-segment","araba-parca","arac-hitap",
         "butce","konfor-skorlama","marka"]

data9 = data9.rename(columns=dict(zip(data9.columns, names)))
data9.head(8)

In [None]:
names5 = ["araba-tur","airbag","araba-yas","araba-performans","araba-kullanım-tur","araba-km",
         "araba-yakıt","araba-segment","araba-parca","arac-hitap",
         "butce","konfor-skorlama","marka"]

names4 = ["araba-tur","airbag","araba-yas","araba-performans","araba-kullanım-tur","araba-km",
         "araba-yakıt","araba-segment","araba-parca","arac-hitap",
         "butce","konfor-skorlama","marka"]

def getOpinion2(data9, marka = None):
    if marka != None: 
        data9 = data9[(data9.marka == mapping_marka[marka])]
    return [data9[col].mean() for col in names5[0:]]

In [None]:
opinions = dict()
for k in mapping_marka.keys():
    opinions[k] = getOpinion2(data9, marka = k)

df2 = pd.DataFrame.from_dict(opinions)
df2.rename(index = dict(zip(range(len(names4[0:])),names4[0:])),inplace=True)
df2 = df2.reindex(columns=['Alfa Romeo', 'Audi', 'Bmw', 'Chevrolet', 'Citroen', 'Dacia', 'Fiat', 'Ford', 'Honda', 'Hyundai',
                        'Kia', 'Mercedes', 'Mitsubishi', 'Nissan', 'Opel', 'Peugeot', 'Renault', 'Toyota',
                        'Volkswagen', 'Volvo'])
print("")
df2.T

## PCA (Feature Selection)
<a id="ch5"></a>
**What is The PCA?**

- Pca is a useful statistical technique used in recognition, classification, image compression fields. pca is a very effective method to reveal the necessary information on the front. to reduce the number of dimensions, and to compress the data by finding the general properties of the oversized data. The basic logic behind the PCA is to show a multidimensional data with fewer variables by catching the basic features of the verb. the point at which some properties will be lost due to size reduction; but it is intended that these lost characteristics contain little information about the population. usually face detection is used in image compression areas


* #### **As you know, the way to get high accuracy goes through the right feature selection. So we will do feature selection with PCA at once and what will be the results?**

### Build PCA
<a id="ch6"></a>

In [None]:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

names2 = ["araba-tur","airbag","araba-yas","araba-performans","araba-kullanım-tur","araba-km",
         "araba-yakıt","araba-segment","araba-parca","arac-hitap",
         "butce","konfor-skorlama","marka","kume"]

df2 = data8
x = df2.loc[:, names2].values
y = df2.loc[:,['kume']].values
x = StandardScaler().fit_transform(x)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])


In [None]:
finalDf = pd.concat([principalDf, df2[['kume']]], axis = 1)
finalDf.head(8)

### PCA Model Scores
<a id="ch7"></a>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier

predictors = finalDf.drop(["kume"], axis=1)
target = finalDf["kume"]
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size = 0.25, random_state = 0)

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.neural_network import MLPClassifier

models = []
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Naive Bayes', GaussianNB()))
models.append(('Decision Tree (CART)',DecisionTreeClassifier())) 
models.append(('K-NN', KNeighborsClassifier()))
models.append(('AdaBoostClassifier', AdaBoostClassifier()))
models.append(('BaggingClassifier', BaggingClassifier()))
models.append(('RandomForestClassifier', RandomForestClassifier()))

for name, model in models:
    model = model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    from sklearn import metrics
    print("%s -> ACC: %%%.2f" % (name,metrics.accuracy_score(y_test, y_pred)*100))

## 2- Data Visualition
<a id="ch8"></a>

### Correlations
<a id="ch9"></a>

In [None]:
#Lets Have a Look At Correlations
fig, ax = plt.subplots()
fig.set_size_inches(15,15)
sns.heatmap(data.corr(),cbar=True, annot=True, square=True, annot_kws={'size': 12})
plt.tight_layout()
plt.savefig('2-elaraba-corr.png')

In [None]:
corr = df.corr()
fig, ax = plt.subplots()
fig.set_size_inches(20,20)
mask = np.zeros_like(corr) #eğer corr bozuksa markaları göstermiyorsa bunu 
mask[np.triu_indices_from(mask)] = True #ve bunu silip shift+enter yapın ondan sonra geri yapıştırın ve shift+enter
sns.heatmap(corr, mask=mask, cbar=True, annot=True, square=True, annot_kws={'size': 10})
plt.savefig('car-corr.png')

### Graphs
<a id="ch10"></a>

In [None]:
qs = [q for q in questions.features if q not in ["Sex","Age","Region","Education"]]
qf = df.loc[qs]

fig, ax = plt.subplots(figsize=(20,6))
ax.xaxis.set(ticks=range(0,11), # Manually set x-ticks
ticklabels=qs)
qf[['Alfa Romeo','Audi','Bmw', 'Chevrolet', 'Citroen', 'Dacia', 'Fiat', 'Ford', 'Honda', 'Hyundai',
                        'Kia', 'Mercedes', 'Mitsubishi', 'Nissan', 'Opel', 'Peugeot', 'Renault', 'Toyota',
                        'Volkswagen', 'Volvo']].plot(ax=ax,alpha=0.75, rot=80)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.grid()
plt.savefig('compare.pdf')

In [None]:
df = df.T
df = df.drop(['marka'], axis=1)
df = df.drop(['kume'], axis=1)
df = df.T

#plot data
fig, ax = plt.subplots(figsize=(30,10))
ax.xaxis.set(ticks=range(0,14), # Manually set x-ticks
ticklabels=["araba-tur","airbag","araba-yas","araba-performans","araba-kullanım-tur","araba-km",
         "araba-yakıt","araba-segment","araba-parca","arac-hitap",
         "butce","konfor-skorlama","kume"])
df.plot(ax=ax)
plt.grid()

## 3- Buildling Machine Learning Models
<a id="ch11"></a>

In [None]:
#Split Data By Train and Test
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier

data = data.drop(['marka'], axis=1)
predictors = data.drop(["kume"], axis=1)
target = data["kume"]
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size = 0.25, random_state = 0)

In [None]:
#Le
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.neural_network import MLPClassifier

models = []
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Naive Bayes', GaussianNB()))
models.append(('Decision Tree (CART)',DecisionTreeClassifier())) 
models.append(('K-NN', KNeighborsClassifier()))
models.append(('AdaBoostClassifier', AdaBoostClassifier()))
models.append(('BaggingClassifier', BaggingClassifier()))
models.append(('RandomForestClassifier', RandomForestClassifier()))

for name, model in models:
    model = model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    from sklearn import metrics
    print("%s -> ACC: %%%.2f" % (name,metrics.accuracy_score(y_test, y_pred)*100))

In [None]:
#Lets Look at Feature Importance
rf = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=2)
rf.fit(X_train, y_train)
from sklearn.ensemble import RandomForestClassifier
questions = pd.DataFrame({'features': data.columns[:-1],'importance': rf.feature_importances_})
questions = questions.sort_values(by='importance', ascending=False)
questions

## 4- Network Analysis
<a id="ch12"></a>

### Brand Network Anaylsis
<a id="ch13"></a>

In [None]:
import networkx as nx

#Changes from dataframe to matrix, so it is easier to create a graph with networkx
cor_matrix = np.asmatrix(corr)

#Crates graph using the data of the correlation matrix
G = nx.from_numpy_matrix(cor_matrix)

#relabels the nodes to match the  stocks names
G = nx.relabel_nodes(G,lambda x: df.columns[x])

In [None]:
def drawGraph(G, size = 20):
    fig, ax = plt.subplots()
    fig.set_size_inches(size,size)
    
    pos_fr = nx.fruchterman_reingold_layout(G)
    edges = G.edges()

    weights = [G[u][v]['weight'] for u,v in edges]
    labels = {e: round(G[e[0]][e[1]]['weight'],2) for e in edges}
    weights2 = [w**2 for w in weights]

    nx.draw(G, pos=pos_fr, node_size=1000, node_color='lightblue', with_labels=True)

    # Plot edge labels
    nx.draw_networkx_edge_labels(G, pos=pos_fr, edge_labels=labels)
    plt.savefig('graph.pdf')
    
drawGraph(G)

In [None]:
# remove edges with correlation < 0.5
G.remove_edges_from([(u,v) for u,v,e in G.edges(data = True) if e['weight'] < 0.5])
drawGraph(G, size =30)

In [None]:
# remove edges with correlation < 0.8
G.remove_edges_from([(u,v) for u,v,e in G.edges(data = True) if e['weight'] < 0.8])
drawGraph(G, size=30)

In [None]:
# remove edges with correlation < 0.9
G.remove_edges_from([(u,v) for u,v,e in G.edges(data = True) if e['weight'] < 0.9])
drawGraph(G, size=30)

### Feature Network Anaylsis
<a id="ch14"></a>

In [None]:
#Changes from dataframe to matrix, so it is easier to create a graph with networkx
cor_matrix = np.asmatrix(df.T.corr())

#Crates graph using the data of the correlation matrix
G = nx.from_numpy_matrix(cor_matrix)

#relabels the nodes to match the  stocks names
G = nx.relabel_nodes(G,lambda x: df.T.columns[x])

In [None]:
drawGraph(G, size = 25)

## 5- Clustering with hierarchical clustering
* ** I did a clustering by looking at the correlations without writing the code. I did this with the help of the clustering method you will see now.**
<a id="ch15"></a>

### Clustering to Data
<a id="ch16"></a>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram


#data_array = ((np.float, len(data['marka'].dtype.names)))
data_array = df.transpose()
data_array = np.array(data_array)

In [None]:
data_dist = pdist(data_array) # computing the distance
data_link = linkage(data_dist)

In [None]:
dendrogram(data_link,labels=data_array.dtype.names)
plt.xlabel('Araba Modelleri')
plt.ylabel('Uzaklık')
plt.suptitle('Samples clustering', fontweight='bold', fontsize=14);

### Clustered Data Correlations
<a id="ch17"></a>

In [None]:
# Compute and plot first dendrogram.
fig = plt.figure(figsize=(12,12))
# x ywidth height
ax1 = fig.add_axes([0.05,0.1,0.2,0.6])
Y = linkage(data_dist, method='single')
Z1 = dendrogram(Y, orientation='right',labels=data_array.dtype.names) # adding/removing the axes
ax1.set_xticks([])


# Compute and plot second dendrogram.
ax2 = fig.add_axes([0.3,0.71,0.6,0.2])
Z2 = dendrogram(Y)
ax2.set_xticks([])
ax2.set_yticks([])

#Compute and plot the heatmap
axmatrix = fig.add_axes([0.3,0.1,0.6,0.6])
idx1 = Z1['leaves']
idx2 = Z2['leaves']
D = squareform(data_dist)
D = D[idx1,:]
D = D[:,idx2]
im = axmatrix.matshow(D, aspect='auto', origin='lower',cmap=plt.cm.YlGnBu)
axmatrix.set_xticks([])
axmatrix.set_yticks([])

# Plot colorbar.
axcolor = fig.add_axes([0.91,0.1,0.02,0.6])
plt.colorbar(im, cax=axcolor)

## Source 
* [SciPy Hierarchical Clustering and Dendrogram Tutorial](https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/)
* [PCA using Python (scikit-learn)](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60)
* [Uzay Cetin's Network Analysis](http://github.com/uzay00)


**Thanks for reading, Dont Forget Your comments are worth gold for me**