#### This Kernel Follows Part 6 of the sentdex's [Data Analysis with Pandas](https://www.youtube.com/playlist?list=PLQVvvaa0QuDfSfqQuee6K8opKtZsh7sA9)

In [None]:
import pandas as pd

df = pd.read_csv('../input/diamonds.csv', index_col=0)
df.head()

The task is to predict the price of a diamond based on all the other data about the diamond in the dataset. Since the prediction is a continuous value - like 326 dollars for Diamond \#1, it is a Regression problem. Constructing a linear regressor as our model using scikit-learn library.


Cursory look at the columns tell that all of them are pretty important at determining the price of the diamond, hence we will use all the columns (except the price) as our features, and price is our target. 

Some of our features seem to be a value from a finite set of possible values - like "cut" and "color", our linear regressor can only take numerical values, we need to turn these into numericals, we can just assign a number of each of the possible values (a vocabulary) like so:

In [None]:
df['cut'].astype("category").cat.codes[:200]

The above will take cut and assigns increasing values based on when it encounters a new possible value. 

By doing this, we are losing the weight of each value in the cut's vocabulary - "Premium" cut is definitely more valuable than "Good" cut. So need to associate a number which reflects the semantic weight of the values "cut" can take.

To achieve this - creating a dictionary for each of the text based features 

In [None]:
cut_class_dict = {"Fair": 1, "Good": 2, "Very Good": 3, "Premium": 4, "Ideal": 5}
color_dict = {"J": 1,"I": 2,"H": 3,"G": 4,"F": 5,"E": 6,"D": 7}
clarity_dict = {"I3": 1, "I2": 2, "I1": 3, "SI2": 4, "SI1": 5, "VS2": 6, "VS1": 7, "VVS2": 8, "VVS1": 9, "IF": 10, "FL": 11}

The numerical values corresponding to each text value possibility for the dictionary come from the descriptive text in the dataset which talks about the relative values for the vocalbulary 

In [None]:
# Mapping using these dictionaries in the dataframe
df['cut'] = df['cut'].map(cut_class_dict)
df['color'] = df['color'].map(color_dict)
df['clarity'] = df['clarity'].map(clarity_dict)

df.head()

In [None]:
import sklearn
from sklearn import svm

# Shuffle the dataframe using sk learn, you can use pandas reindex method with np.random too
df = sklearn.utils.shuffle(df)

In [None]:
# X is the feature set - a list of list of features
X = df.drop('price', axis=1).values
# y is the target - a list of prices
y = df['price'].values

### Scaling the values in feature set to be between 0 and 1

Models like the Linear Regressors are basically just doing linear algebra over and over again and the simpler you can make the data - by reducing the spread, the faster it can converge! We can employ scaling to all the columns of the data rows to be between 0 and 1 - there are many math formulas to do this
* `scaled_value = value / mean`
* `scaled_value_z_score = value - mean / standard_deviation`

and many others

Here we will use sklearn's preprocessing to do the magic for us, we are not as much worried about the specific range of values post scaling - it can be 0 to 1 or -1 to 1 or -3 to 3 - we are good if it is sufficiently small enough!


In [None]:
X = sklearn.preprocessing.scale(X)

### Train and test split of the data rows

To validate the model we will hold out some data rows - test data - and not use them to train the model, by evaluating the model's prediction on previously unseen data (during the training phase that is) we will know the correctness of the model - whether it can take in new data when deployed to production and provide good predictions on data that will come in tomorrow.

In [None]:
# Video uses a manual method, but sklearn gives a nice method to do this declaratively
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.3, random_state=101)

### Defining and training the model using train data

Using Support Vector Regressor - a modification of SVM which, used for classification, modified using a kernel to act as a regressor.

Sklearn is a beauty when it comes to API to define and train models

In [None]:
%%time
# Support Vector Regression with Linear Kernel
model = svm.SVR(kernel='linear')
model.fit(X_train, y_train)

Now let us put the trained model on a pedestal and ask it to predict for our test data and evaluate the score - higher the score = predictions matched well with the actual values in test data set

In [None]:
model.score(X_test, y_test)

Prodding further - printing out a few predicted and actual values for the test set 

In [None]:
for X, y in list(zip(X_test, y_test))[:50]:
    print(f"Predicted: {model.predict([X])[0]}, Actual: {y}")

For some cases it is predicting negative prices! Not so useful in those cases.

#### Generally there are two approaches from here
* Train a few different models, compare and contrast
* Tweak the model's knobs to perform better

We will take the first route, train a few and evaluate

In [None]:
%%time
# Support Vector Regression with RBF Kernel
model = svm.SVR(kernel='rbf')
model.fit(X_train, y_train)

In [None]:
print(f"--- Score: {model.score(X_test, y_test)} ---")

for X, y in list(zip(X_test, y_test))[:50]:
    print(f"Predicted: {model.predict([X])[0]}, Actual: {y}")

RBF Kernel seem to have done really bad with score of 0.51, but hey, it does have the negative problem which the linear kernel had

In [None]:
%%time
# Using SGD Regressor
model = sklearn.linear_model.SGDRegressor(max_iter=10_000)
model.fit(X_train, y_train)

In [None]:
print(f"--- Score: {model.score(X_test, y_test)} ---")

for X, y in list(zip(X_test, y_test))[:50]:
    print(f"Predicted: {model.predict([X])[0]}, Actual: {y}")

In [None]:
%%time
# Using Linear Regression
model = sklearn.linear_model.LinearRegression()
model.fit(X_train, y_train)

In [None]:
print(f"--- Score: {model.score(X_test, y_test)} ---")

for X, y in list(zip(X_test, y_test))[:50]:
    print(f"Predicted: {model.predict([X])[0]}, Actual: {y}")

SGD Regressor seem to have done quite well with score of 0.9, but there are some negative predictions here too!

Ensembling these together will give a better score, this is what is done in production systems, ensembles would throw out negative predictions by constituent models and take prediction from another non-negative model - thus smoothing out the aberrations 

---

As some people in the video's comments section have mentioned - to use "dummies" instead of assigning our own scale values for text based columns - we are losing out on the scale of difference, how different is "Premium" from "Fair" - is price affected by the linear scale what we have assumed, or there is a different scale - an exponential different between "Premium" and "Fair". 

To sidestep this entire debate, we can use a representation called ["dummies" in our dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html), referred to as one-hot encoding as well, where for each possible value of "cut" or "color" columns we create a new column in our dataframe and if the cut is "Ideal" then the value of column "cut_Ideal" is marked 1, all others are marked as 0. 

In [None]:
df = pd.read_csv('../input/diamonds.csv', index_col=0)
dummies_df = pd.get_dummies(df)
dummies_df.head()

In [None]:
# X is the feature set - a list of list of features
X = dummies_df.drop('price', axis=1)
# y is the target - a list of prices
y = dummies_df['price'].values

# Scale our features
X[['depth', 'carat', 'table', 'x', 'y', 'z']] = sklearn.preprocessing.scale(X[['depth', 'carat', 'table', 'x', 'y', 'z']])
X = X.values

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.3, random_state=101)

#### Training all three models, and checking what improvement we can get out of it

In [None]:
%%time
# Support Vector Regression with Linear Kernel
model = svm.SVR(kernel='linear')
model.fit(X_train, y_train)

In [None]:
print(f"--- Score: {model.score(X_test, y_test)} ---")

for X, y in list(zip(X_test, y_test))[:50]:
    print(f"Predicted: {model.predict([X])[0]}, Actual: {y}")

In [None]:
%%time
# Support Vector Regression with RBF Kernel
model = svm.SVR(kernel='rbf')
model.fit(X_train, y_train)

In [None]:
print(f"--- Score: {model.score(X_test, y_test)} ---")

for X, y in list(zip(X_test, y_test))[:50]:
    print(f"Predicted: {model.predict([X])[0]}, Actual: {y}")

In [None]:
%%time
# Using SGD Regressor
model = sklearn.linear_model.SGDRegressor(max_iter=10_000)
model.fit(X_train, y_train)

In [None]:
print(f"--- Score: {model.score(X_test, y_test)} ---")

for X, y in list(zip(X_test, y_test))[:50]:
    print(f"Predicted: {model.predict([X])[0]}, Actual: {y}")

Seems to have no significant improvement, only a small increment, can fall back to LinearRegressor once.

In [None]:
%%time
# Using Linear Regression
model = sklearn.linear_model.LinearRegression()
model.fit(X_train, y_train)

In [None]:
print(f"--- Score: {model.score(X_test, y_test)} ---")

for X, y in list(zip(X_test, y_test))[:50]:
    print(f"Predicted: {model.predict([X])[0]}, Actual: {y}")

A simple linear regressor itself starts giving 0.91 score - by taking much much lesser time to train. Always tend towards simpler models - before jumping into fancy deep learning method - so apparent here when comparing support vector machines vs simple linear regression.

Regarding buying some more accuracy beyond 91%, We will visit back, maybe should:
* give some more training data, current split is 70-30
* tune hyperparameters