**1. Define the Problem**

We want to use this dataset to try some beginner models for ML. My idea is to show if we can predict the type, the region or the price with the given data and simple models. Over time, I might add a time series to the kernel.

This kernel is mainly a project for me to check what I already learned. The best way to learn something is to explain it to someone else ;) 

The procedure is based on the tutorial by https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy



**2. Preparation**

**2.1. Get the Data**

First I import some important libraries as Pandas, Numpy,... Then I'll load the dataset into a pandas dataframe.
You can immediately see that the date cannot be used in this format. Therefore I will split it up into "Day", "Month" and "Year". There's already a "year" column because of this I'll check if it's identical to the new "Year" column and then drop it. 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("../input/avocado.csv")
df.head()

In [None]:
# Split Date into D,M,Y
date = df["Date"].apply(lambda x: x.split("-"))
df["Date_Year"] = date.apply(lambda date: date[0]).astype(int)
df["Date_Month"] = date.apply(lambda date: date[1]).astype(int)
df["Date_Day"] = date.apply(lambda date: date[2]).astype(int)
df.drop("Date", axis=1, inplace = True)

#Check if year and Date_Year are matching
match = df["Date_Year"] == df["year"]
for i in match:
    if i == False:
        print("No match")
df.drop("year", axis=1, inplace = True)

df.head()

**2.2. Get an Overview**

Now we will meet the data. First we look at the length of the dataframe and then we see if the "Unnamed" column reflects the index. It doesn't seem to do this, so I will drop it first.

With almost 18000 entries it seems to be a rather small dataset. We should at least keep this in mind for the next steps.

In [None]:
#Get an Overview
print(df["Unnamed: 0"].max())
print(df.index.max())

df.drop("Unnamed: 0", axis=1, inplace = True)


Now I look at the values of the nominal variables and get the metric data described. 

First, I notice that there are many different regions. It could be helpful to reduce the dimension of this variable. For example we could categorize the region into North, West, South, East. 

A look at the statistics shows us that there are sometimes large ranges between Min and Max values. Therefore, we should at least check these variables for outliers. In addition, you can see that the distribution of various variables is skewed and that we should take a closer look at them.


In [None]:
print(df.type.unique())

In [None]:
print(df.region.unique())
#Maybe cluster to N, W, S, E

In [None]:
df.describe()
#Check for Outliers (Min/Max)

**2.3. 4C's: Completing, Correcting, Creating, and Converting**

**2.3.1. Cheking for Missing Values**

Now we have a rough overview over the individual variables. In the next step I'll check the data for missing values. We don't have any missing values, which is why we're quick with the step "Completing".

In [None]:
print('Columns with NAN:\n', df.isnull().sum())

**2.3.2. Looking for Outliers**

In this step, we will have a look for univariate outliers. 
There are some outliers that appear on the far right of the box plot. Also on the scatter plots you can see some outliers far away from the other values. We need to check this further.

In [None]:
sns.boxplot(x=df['AveragePrice'])

In [None]:
sns.boxplot(x=df['Total Volume'])

In [None]:
sns.boxplot(x=df['4046'])

In [None]:
sns.boxplot(x=df['4225'])

In [None]:
sns.boxplot(x=df['4770'])

In [None]:
sns.boxplot(x=df['Total Bags'])

In [None]:
sns.boxplot(x=df['Small Bags'])

In [None]:
sns.boxplot(x=df['Large Bags'])

In [None]:
sns.boxplot(x=df['XLarge Bags'])

In [None]:
sns.boxplot(x=df['Date_Year'])

In [None]:
sns.boxplot(x=df['Date_Month'])

In [None]:
sns.boxplot(x=df['Date_Day'])

In [None]:
sns.pairplot(df)

Now I calculate the Z score. I choose a distance of 3. All values outside this distance of [-3, 3] will be defined as outlier and then removed from the dataset.

Since the quartiles of some variables are zero, I will not use the IQR, otherwise too many rows will be removed.

You can now check the boxplots again and you will see that the outliers disappear.

In [None]:
#Delete Outliers with ZScore
not_to_check = ["type","region"]

from scipy import stats
 
z = np.abs(stats.zscore(df.drop(not_to_check, axis=1)))
threshold = 3
row_to_drop = np.where(z > threshold)[0]
df = df.drop(df.index[row_to_drop])

print(len(row_to_drop), "Outliers removed")

**2.3 Explore the Data**

Now that the data has been cleaned up, we take a closer look at the individual variables to see what we can do with them.

As you can see there is a lot of skewness. 

In [None]:
sns.pairplot(df)

In [None]:
#Heatmap

corr = df.corr()
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)

To improve the skewness we will logarythmize the variables in the hope to get closer to a normal distribution.

You can also see a strong correlation between "Total Bags" and "Small Bags", which is obvious as "Total Bags" is a sum of the other columns. The 

In [None]:
input_list = ["Total Volume", "4046", "4225", "4770", "Total Bags", "Small Bags", "Large Bags", "XLarge Bags"]

def log_var(input_var):
    df[input_var] = np.log(1 + df[input_var])

log_var(input_list)

df.head()

In [None]:
sns.pairplot(df)

**3. Preprocessing ** 

In this section I will show some approaches to data preparation. However, not every step is suitable, depending on what you want to test. I try to note when you should not use the step. For example, it's not useful to divide the region into dummies if you want to predict them later as dependent variables.

**3.1. Encode Nominal Variables**

As a first step we code the type of cultivation ('type') numerically. No information is lost, which is why we perform these steps first. 

We also drop the day because in a simple model I assume that the day has no influence if you look at the whole year. But with a more advanced model, it might be interesting to look at the days of the week. Maybe more avocados are bought on Mondays.

In [None]:
df["type"] = df["type"].replace("conventional",0).replace("organic",1)
df.drop("Date_Day", axis=1, inplace=True)
df.head()

**3.2. Create Dummies**

For the variable 'region' we use dummies to better pass them to the model. We could also code the variable numerically as we did before. 

If you use your model to predict the region, you should not use dummy variables. Therefore we make a copy of the dataframe to use it later without dummies for predicting the region.

In [None]:
df_region = df
df = pd.get_dummies(df, columns = ["region"])

df.head()

**3.3. Group Data**

A third approach to dealing with data is to group it. We use a very simple approach and divide the price into high and low in relation to the mean. This is necessary if, for example, we want to determine the price by a logistic regression. Therefore we make a copy of the dataframe to use it later.

In [None]:
df_price = df

mean = df_price["AveragePrice"].mean()

df_price.loc[df_price["AveragePrice"] > mean, 'Price'] = 1
df_price.loc[df_price["AveragePrice"] < mean, 'Price'] = 0

df_price = df_price.drop("AveragePrice", axis=1)

**4. Model**

**4.1. Logistic Regression**

Now we can use a first model. For the beginning I use a logistic regression to predict the type of cultivation. To do this we drop this property for x. Then we divide the dataset into a train and a test part and perform the regression and let us output the score to get a first estimation about the model.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

x = df.drop(["type"], axis=1)
y = df["type"]

x_train, x_test, y_train, y_test = train_test_split(x,y)

model = LogisticRegression(solver="liblinear")
model.fit(x_train, y_train)
print(model.score(x_test, y_test))
print(model.score(x_train, y_train))

The results for both datasets are very similar and very high. But what if there are very few organically grown avocados in the dataset and we therefore make such a good statement? Imagine there would only be 1 'organic' in a set of 100 avocados. Then a simple estimate would be 99% accurate. Let's check that out.

Take a look at the ROC Curve. We have a perfect right angle and thus a very high hit rate with no false positive. The recall curve shows that we have a high precision and high recall. The sample is so we can say that our model works very well for this variable. 


In [None]:
df["type"].value_counts()

In [None]:
from sklearn.metrics import roc_curve, auc, precision_recall_curve

#ROC Curve
fpr_model, tpr_model, thresholds_model = roc_curve(y_test, model.predict_proba(x_test)[:,1])
plt.plot(fpr_model, tpr_model)
plt.xlabel("P(FP)")
plt.ylabel("P(TP)")

In [None]:
#Recall Curve
precision_model, recall_model, thresholds_model = precision_recall_curve(y_test, model.predict_proba(x_test)[:,1])
plt.plot(precision_model, recall_model)
plt.xlabel("Precision")
plt.ylabel("Recall")

**4.2. KNN & RandomForest**

**4.2.1. Model**

That was easy. Let's have a look at a little bit more difficult prediction. Now we want to predict the region.

To do this, we first use logistic regression. As you can see, this does not produce good results in this case. 
So we try some other models. For example, we use the KNN for a next neighbor classifier and Random Forest for a decision tree.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

x = df_region.drop(["region"], axis=1)
y = df_region["region"]

x_train, x_test, y_train, y_test = train_test_split(x,y)

model = LogisticRegression(solver="liblinear")
model.fit(x_train, y_train)
print(model.score(x_test, y_test))

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

x = df_region.drop(["region"], axis=1)
y = df_region["region"]

x_train, x_test, y_train, y_test = train_test_split(x,y)

knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(x_train, y_train)

print(knn.score(x_test, y_test))

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

x = df_region.drop(["region"], axis=1)
y = df_region["region"]

x_train, x_test, y_train, y_test = train_test_split(x,y)

rfc = RandomForestClassifier(criterion = "entropy", n_estimators = 300)
rfc.fit(x_train, y_train)

print(rfc.score(x_test, y_test))

**4.2.2. Cross-Validation**

Now we already have significantly better results. If you now play with the parameters, you will see that the results can be improved/worsened. 

Let's check if the results are similar with a cross validation. We do not use cross-validation for the RFC, since this is already covered by the procedure itself.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold

x = df_region.drop(["region"], axis=1)
y = df_region["region"]

scores = cross_val_score(LogisticRegression(solver="liblinear"), x, y, cv = RepeatedKFold(n_repeats = 2))

print(np.mean(scores))

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold

x = df_region.drop(["region"], axis=1)
y = df_region["region"]

knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(x,y)
scores = cross_val_score(knn, x, y, cv = RepeatedKFold(n_repeats = 2))

print(np.mean(scores))

**4.3. The whole model**

Now we want to apply all the methods shown together to the Age variable and try to predict whether the price is low or high.

First let's test some models.

In [None]:
df_price.head()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

x = df_price.drop(["Price"], axis=1)
y = df_price["Price"]

x_train, x_test, y_train, y_test = train_test_split(x,y)

model = LogisticRegression(solver="liblinear")
model.fit(x_train, y_train)
print(model.score(x_test, y_test))
print(model.score(x_train, y_train))

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

x = df_price.drop(["Price"], axis=1)
y = df_price["Price"]

x_train, x_test, y_train, y_test = train_test_split(x,y)

knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(x_train, y_train)

print(knn.score(x_test, y_test))

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

x = df_price.drop(["Price"], axis=1)
y = df_price["Price"]

x_train, x_test, y_train, y_test = train_test_split(x,y)

rfc = RandomForestClassifier(criterion = "entropy", n_estimators = 300)
rfc.fit(x_train, y_train)

print(rfc.score(x_test, y_test))

**4.3.2. Validate**

The cross validation and the curves look pretty good for KNN and RFC. So we can assume that our model works well. Of course the accuracy can be increased a bit. 

In [None]:
from sklearn.model_selection import cross_val_score

x = df_price.drop(["Price"], axis=1)
y = df_price["Price"]

scores = cross_val_score(LogisticRegression(solver="liblinear"), x, y, cv = RepeatedKFold(n_repeats = 2))

print(np.mean(scores))

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold

x = df_price.drop(["Price"], axis=1)
y = df_price["Price"]

knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(x,y)
scores = cross_val_score(knn, x, y, cv = RepeatedKFold(n_repeats = 2))

print(np.mean(scores))

In [None]:
df_price["Price"].value_counts()

In [None]:
#ROC Curve
fpr_model, tpr_model, thresholds_model = roc_curve(y_test, model.predict_proba(x_test)[:,1])
plt.plot(fpr_model, tpr_model, label = "LogisticRegression")

fpr_knn, tpr_knn, thresholds_knn = roc_curve(y_test, knn.predict_proba(x_test)[:,1])
plt.plot(fpr_knn, tpr_knn, label = "KNN")

fpr_rfc, tpr_rfc, thresholds_rfc = roc_curve(y_test, rfc.predict_proba(x_test)[:,1])
plt.plot(fpr_rfc, tpr_rfc, label = "RFC")
plt.xlabel("P(FP)")
plt.ylabel("P(TP)")
plt.legend(loc = "best")

In [None]:
#Recall Curve
precision_model, recall_model, thresholds_model = precision_recall_curve(y_test, model.predict_proba(x_test)[:,1])
plt.plot(precision_model, recall_model, label = "LogisticRegression")

precision_knn, recall_knn, thresholds_knn = precision_recall_curve(y_test, knn.predict_proba(x_test)[:,1])
plt.plot(precision_knn, recall_knn, label = "KNN")

precision_rfc, recall_rfc, thresholds_rfc = precision_recall_curve(y_test, rfc.predict_proba(x_test)[:,1])
plt.plot(precision_rfc, recall_rfc, label = "RFC")

**4.3.3. Validation Curve**

Now we take a closer look at the validation curve for the KNN method. We see that we have an overfitting in a few neighbours and on the right side we come into an underfitting area. Thus we are with the selected n_neighbors = 5 already strongly at the border to the overfitting.

In [None]:
#Validation Curve
from sklearn.model_selection import validation_curve
param_range = np.array([40, 30, 20, 15, 10, 8, 7, 6, 5, 4, 3, 2, 1])

x = df_price.drop(["Price"], axis=1)
y = df_price["Price"]

train_scores, test_scores = validation_curve(
    KNeighborsClassifier(), 
    x,
    y,
    param_name = "n_neighbors",
    param_range=param_range)

plt.plot(param_range, np.mean(train_scores, axis = 1))
plt.plot(param_range, np.mean(test_scores, axis = 1))

**4.3.4. Learning Curve**

We can also have the Learning Curve drawn. This shows us whether it would be possible to improve the model with a larger data set. It can be seen, however, that none of the curves really converges further, so a larger data set does not bring much improvement in this case.

In [None]:
#Learning Curve
from sklearn.model_selection import learning_curve
from sklearn.utils import shuffle

x = df_price.drop(["Price"], axis=1)
y = df_price["Price"]

x, y = shuffle(x, y)

train_sizes_abs, train_scores, test_scores = learning_curve(LogisticRegression(solver="liblinear"), x, y)
plt.plot(train_sizes_abs, np.mean(train_scores, axis = 1))
plt.plot(train_sizes_abs, np.mean(test_scores, axis = 1))

In [None]:
x = df_price.drop(["Price"], axis=1)
y = df_price["Price"]

x, y = shuffle(x, y)

train_sizes_abs, train_scores, test_scores = learning_curve(KNeighborsClassifier(n_neighbors=5), x, y)
plt.plot(train_sizes_abs, np.mean(train_scores, axis = 1))
plt.plot(train_sizes_abs, np.mean(test_scores, axis = 1))

In [None]:
x = df_price.drop(["Price"], axis=1)
y = df_price["Price"]

x, y = shuffle(x, y)

train_sizes_abs, train_scores, test_scores = learning_curve(RandomForestClassifier(criterion = "entropy", n_estimators = 300), x, y)
plt.plot(train_sizes_abs, np.mean(train_scores, axis = 1))
plt.plot(train_sizes_abs, np.mean(test_scores, axis = 1))

**5. Further Actions**

Now we have already been able to make a quite good prediction. But how can we improve the model apart from feature engineering? 

**5.1. Hyperparameter Tuning ** 

One possibility would be to adjust the parameters of the model. We could use GridSearchCV to compare different parameters and select the best one. I will now do this using the KNN classifier as an example.

As you can see we get the best possible value for the desired parameter and also the score that matches this value. You can also see that this value, if we now think back to the curves, is in the area where no overfitting or underfitting occurs. 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV

x = df_price.drop(["Price"], axis=1)
y = df_price["Price"]

x_train, x_test, y_train, y_test = train_test_split(x,y)

# Choose the type of classifier. 
clf = KNeighborsClassifier()

# Choose some parameter combinations to try
parameters = {'n_neighbors': [1,3,5,7,9,11,13,15], 
             }

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)

# Run the grid search
grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(x_train, y_train)

# Set the clf to the best combination of parameters
print(grid_obj.best_estimator_)
print(grid_obj.best_score_)

**5.2. Compare Different Models**


In [None]:
models = ["Logistic Regression", "Random Forest Classifier", "KNeighbors Classifier"]
scores = []
predictions_list = []

x = df_price.drop(["Price"], axis=1)
y = df_price["Price"]

x_train, x_test, y_train, y_test = train_test_split(x, y)

lr = LogisticRegression()
rfc = RandomForestClassifier(criterion = "entropy", n_estimators = 300)
knn = KNeighborsClassifier(n_neighbors = 5)

Logistic_Regression = lr.fit(x_train,y_train)
RFC = rfc.fit(x_train, y_train)
KNN = knn.fit(x_train,y_train)

scores.append(Logistic_Regression.score(x_train, y_train))
scores.append(RFC.score(x, y))
scores.append(KNN.score(x_train, y_train))

predictions_LR = Logistic_Regression.predict(x_test)
predictions_list.append(accuracy_score(y_test, predictions_LR))
predictions_RFC = RFC.predict(x_test)
predictions_list.append(accuracy_score(y_test, predictions_RFC))
predictions_KNN = KNN.predict(x_test)
predictions_list.append(accuracy_score(y_test, predictions_KNN))

In [None]:
sns.set_color_codes("muted")
sns.barplot(x=predictions_list, y=models, color="b")

plt.xlabel('Accuracy %')
plt.title('Classifier Accuracy')
plt.show()