# Introduction

Over the past few years, Data Science has been rising in popularity. With such advancements, automation of all the tough methods from yesteryear began to occur. 
Instead of writing a hashed out algorithm in C++, a data science pracitioner only had to call a wrapper for the model to run. It certainly has made life easy for a lot of people in the last couple of years. It did the same for me. 

A popular saying goes along the lines of, _The important thing is not to stop questioning. Curiosity has its own reason for existence. One cannot help but be in awe when he contemplates the mysteries of eternity._

I was stuck with the same mindset. Calling .fit on a model did run perfectly, but how did it actually run? 

After a lot of careful pondering over that question, I dredged up the trusty ol' internet and began studying on how the actually worked. 

This series, a from scratch series, is meant to implement all the popular machine learning algorithms from scratch based on their mathematics principle. 


# Contents: 
1. [What is Linear Regression? ](#1) 
    1. [ELI5: Linear Regression, as simple as possible](#2)
3. [Brief Data Outlook](#3)
    1. [Loading Libraries and Data](#4)
    2. [Understanding Data](#5)
3. [Exploratory Data Analysis](#6)
    1. [How much beer was consumed overall on weekend and weekdays??](#7) 
    2. [How much beer was consumed on an average for each day?](#8)
    3. [How much beer was consumed each month?](#9)
    4. [How was the temperature for the year?](#10)
    5. [How much beer was consumed each season?](#11)
    6. [How is the rainfall for the year?](#12)
    
4. [Preparing Data](#13)

## [What is Linear Regression?]()<a id="1"></a> <br>
#### [ELI5: Linear Regression, as simple as possible.]()<a id="2"></a> <br>

Let's consider that we have two sets of data. Let's try to be as silly as possible for this. 
- Height (X)
- Income (Y)

We are holding an assumption that the data is continuous, i.e, scales up in a specific order as:
- 100, 200, 300.....

Since both height and income can follow this pattern, this would be a satisfactory dataset. 

Our aim, with linear regression, is to find relations between the two datasets. Our entire approach revolves around the question:
- Is income related to the height of the person? 
- Do taller people earn more? 
- If it really does, can we predict it? 

Considering our dataset for a 1000 people, our dataset would consist of these datapoints: 
- 145 cm, 51000 USD
- 184 cm, 62000 USD
- 152 cm, 59000 USD

And so on. 

Linear regression lets us do some math on these data points and describes the correlation between these datasets. That is, how strongly does height really relate with high income.

Since correlation, in this example, has a range 0f 0-1, the correlation would confirm the following hypothesis: 
- A correlation of 1 would mean being tall always means you have a higher income.

- A correlation of 0 would mean there is no relationship between height and income.

- A negative value means a negative correlation, that is, taller people tend to have lower incomes. A correlation of -1 would mean taller people always make less money than shorter people.


This allows us to understand the relation between two variables, or two types of data. 
Also, once we have the relation, we can use linear regression for prediction. If we know the rough relationship between X and Y, then we can use this relationship to predict values of Y for a value of X we want.

For this example, it would be: prediciting income for the height of the person


Before we head on to the math, let's go through the regular steps:
- Brief Data Outlook
- EDA
- Data Cleaning and Preprocessing
- Data Modelling
- Results and Evaluation

## [Brief Data Outlook]()<a id="3"></a> <br>

### [Loading Libraries and Data]()<a id="4"></a> <br>

In [None]:
#Basic needed libraries
import numpy as np
import pandas as pd

#visualization libraries
import plotly.express as px
import plotly.graph_objects as go


#Processing:
import calendar

In [None]:
# I had noticed that the dataframe had commas for decimal, pandas automatically can handle it during loading with the decimal field. 
df =  pd.read_csv("../input/beer-consumption-sao-paulo/Consumo_cerveja.csv",  decimal=",")


### [Understanding Data]()<a id="5"></a> <br>

In [None]:
#Finding Number of rows and columns
print("Dataset contains {} rows and {} columns".format(df.shape[0], df.shape[1]))

Let's begin the journey, there seems to be a few features provided for the dataset. Yet, it all seems to be in a different language. Being only familiar with English, my first step would be to translate them into English and then proceed forward. 



In [None]:
df.head()

In [None]:
df.columns = ['Date', 'Temp_Median', 'Temp_Min', 'Temp_Max', 'Precipitation', 'Weekend', 'Consumption_Litres']

In [None]:
df.head()

#### Columns Description

- **Date**: The date, of the year 2015. This seems to be a daily date tracking. 
- **Temp Median**: Median temperature for the day.
- **Temp Min**: Min temperature for that day. 
- **Temp Max**: Max temperature for that day.
- **Precipitation**: Chance of rain for that day. 
- **Weekend**: Bool, 1: Weekend, 0: Weekday. 
- **Consumption Litres**: The amount of beer consumed for that day.

#### Converting features to numeric values

Checking for null values

In [None]:
df.isnull().sum()

In [None]:
# Dropping rows as after the year ends, there is no information provided, only null values
df =  df.dropna()

In [None]:
df['Consumption_Litres'] = pd.to_numeric(df['Consumption_Litres'])

## [Exploratory Data Analysis]()<a id="6"></a> <br>

#### [How much beer was consumed overall on weekend and weekdays?]()<a id="7"></a> <br>

We can observe that on an average, the beer consumed on weekdays was more than the beer consumed on the weekend. But, since this is just an aggregate for how much was consumed overall, we can understand that weekends are less in number. Thus, let's create a chart for per day, where we can observe how much beer was consumed on an average for each day. 

In [None]:
weekdays = sum(df[df.Weekend == 0]['Consumption_Litres'])
weekend = sum(df[df.Weekend==1]['Consumption_Litres'])

labels = ['Weekdays','Weekend']
values = [weekdays, weekend]
colors = ['crimson']

fig = go.Figure(data=[go.Bar(
    x=labels, y=values, marker_color= colors
)])
fig.show()

#### [How much beer was consumed on an average for each day?]()<a id="8"></a> <br>

First, we will need to add the days into the dataframe. While we are doing this, let's also add the month of the year too. 

In [None]:
df['Date'] = pd.to_datetime(df['Date'])


In [None]:
#Extracting day and month from the datetime object:

df['Months'] = df['Date'].apply(lambda x: x.strftime('%B'))
df['Day'] = df['Date'].apply(lambda x: x.strftime('%A'))

In [None]:
df.head()

We can now see the Months and the Days of the year 2015. 

In [None]:
fig = px.box(df, x="Day", y="Consumption_Litres", color="Day", orientation='v', notched=True, title = 'Beer Consumption by day of the week' )

fig.show()

We can observe that the most beer was consumed on Saturdays, with Sunday following it. This gives rise to the conclusion that on an average, beer was consumed most during the weekends. 

#### [How much beer was consumed each month?]()<a id="9"></a> <br>

In [None]:
fig = px.box(df, x="Months", y="Consumption_Litres", color="Months", orientation='v', notched=True, title = 'Beer Consumption by Months of the year' )

fig.show()

We can observe that the Beer consumption was higher in the end of the year: December to March, while it decreased around the mid, June and July. 
Considering that the southern hemisphere has a different time for seasons, the temperature should be checked. 


#### [How was the temperature for the year?]()<a id="10"></a> <br>

In [None]:


fig = go.Figure()
# Create and style traces
fig.add_trace(go.Scatter(x=df['Date'], y=df['Temp_Min'], name='Minimum Temperature',
                         line=dict(color='royalblue', width=3)))
fig.add_trace(go.Scatter(x=df['Date'], y=df['Temp_Max'], name = 'Max Temperature',
                         line=dict(color='crimson', width=3)))
fig.add_trace(go.Scatter(x=df['Date'], y=df['Temp_Median'], name='Median Temperature',
                         line=dict(color='orange', width=3,
                              dash='dash') 
))


# Edit the layout
fig.update_layout(title='Temperature throughout Brazil',
                   xaxis_title='Dates',
                   yaxis_title='Temperature (degrees C)')


fig.show()

From this, we can observe that the seasons really have been changed. 

After researching on the web, we can observe: (https://www.frommers.com/destinations/brazil/planning-a-trip/when-to-go)

- Summer Months: December to March
- Autumn Months: April-May
- Winter Months: June - September
- Spring: October-November

Using that data, we can assign seasons to the dataset. 

This idea was taken after going through the graph here: kaggle.com/michau96/when-brazilian-students-drink-beer/data


In [None]:
seasons_map = {
    'January': 'Summer',
    'February': 'Summer',
    'March': 'Summer',
    'April': 'Autumn',
    'May': 'Autumn',
    'June': 'Winter',
    'July': 'Winter',
    'August': 'Winter',
    'September': 'Winter',
    'October': 'Spring',
    'November': 'Spring',
    'December': 'Summer'
    
}


df['Season'] = df['Months'].apply(lambda x: seasons_map[x])

#### [How much beer was consumed each season?]()<a id="11"></a> <br>

In [None]:
fig = px.box(df, x="Season", y="Consumption_Litres", color="Season", orientation='v', notched=True, title = 'Beer Consumption by Seasons of the year' )

fig.show()

We can observe that the highest beer consumption is during Summer and Spring, followed by winter, which is surprising as I had expected Winter to have the least amount of beer consumption. 

#### [How is the rainfall for the year?]()<a id="12"></a> <br>

In [None]:


fig = go.Figure()
# Create and style traces
fig.add_trace(go.Scatter(x=df['Date'], y=df['Precipitation'], name='Rainfall for the year',
                         line=dict(color='blue', width=3)))



# Edit the layout
fig.update_layout(title='Rainfall throughout Brazil',
                   xaxis_title='Dates',
                   yaxis_title='Precipitation')


fig.show()

It doesn't seem that the rainfall had a tremendous effect on beer consumption. It is highest during the summer months, but since beer is meant to defeat the heat, it could be a case of correlation doesn't equal causality. 

Having gone through the basic EDA, let's head into the crux of this notebook: Linear regression from scratch.

## [Preparing Data]()<a id="13"></a> <br>

# To be Continued. This part to be out by 12th Aug 2021. 