## About
_________

This notebook contains a very fast fundamental Linear Regression example in Python.
	
This work is part of a series of notebooks called Machine learning in minutes - very fast fundamental examples in Python. 
	
The approach is designed to help grasp the applied machine learning lifecycle in minutes. It is not an alternative to actually taking the time to learn. What it aims to do is help someone get started fast and gain intuitive understanding of the typical steps early on.

## Step 0: Understand the problem
What we're trying to do here is to predict the price of a diamond given a certain feature of the diamond.

## Step 1: Set-up and understand data
This step helps uncover issues that we will want to address in the next step and take into account when building and evaluating our model. We also want to find interesting relationships or patterns that we can possibly leverage in solving the problem we specified.

In [None]:
# Set-up libraries
import os
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
# Check data input source
for dirname, _, filenames in os.walk('../input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Read-in data
df = pd.read_csv('../input/diamonds/diamonds.csv')

In [None]:
# Look at some information
df.info()

In [None]:
# Look at some records
df.head()

In [None]:
# Summarise data
df.describe()

In [None]:
# Check for missing values
df.isna().sum()

In [None]:
# Check for duplicate values
df.duplicated().sum()

In [None]:
# Explore data visually
sns.pairplot(df)

## Step 2: Preprocess data
This step typically takes the most time in the cycle but for our purposes, most of the datasets chosen in this series are clean. 
	
Real-world datasets are noisy and incomplete. The choices we make in this step to address data issues can impact downstream steps and the result itself. For example, it can be tricky to address missing data when we don't know why it's missing. Is it missing completely at random or not? It can also be tricky to address outliers if we do not understand the domain and problem context enough.

In [None]:
# Split dataset into 80% train and 20% validation
X = df['carat'].values.reshape(-1, 1)
y = df['price'].values.reshape(-1, 1)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

## Step 3: Model and evaluate
This last step is three-fold.
	
We create the model and fit the model to the data we prepared for training.
	
We then proceed to making predictions with the data we prepared for validation.
	
Lastly, we evaluate the model's accuracy. For the purposes of this exercise, we stopped at the accuracy metric but there are other metrics we can use to evaluate the model such as Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, Median Absolute Error, R2, and so on.

In [None]:
# Build model
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
# Apply model to test data
y_predict = model.predict(X_val)

actual_vs_predict = pd.DataFrame({'Actual': y_val.flatten(), 'Prediction':y_predict.flatten()})
actual_vs_predict.sample(12)

In [None]:
# Evaluate model
print('Accuracy: %.2f' % (model.score(X_val, y_val)*100), '%' )

## Learn more
If you found this example interesting, you may also want to check out the other works in the series.

* [Machine learning in minutes - very fast fundamental examples in Python](https://www.kaggle.com/jamiemorales/machine-learning-in-minutes-very-fast-examples)
	
Thanks for reading. Don't forget to upvote.