### Tasks
Analyze a database and, based on it:

* Process the data
* Format (if necessary)
* Train the model
* Predict December sale
* Create a histogram of the data
* Create a scatter plot

In [2]:
# Importing libs
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [3]:
# Bringing the dictionary to be used
dict_sales = {
    'month': ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'],
    'sales': [2000, 2200, 2300, 2500, 2600, 2700, 2800, 2900, 3000, 3100, 3200, 3300]
}

In [4]:
dict_sales

{'month': ['January',
  'February',
  'March',
  'April',
  'May',
  'June',
  'July',
  'August',
  'September',
  'October',
  'November',
  'December'],
 'sales': [2000,
  2200,
  2300,
  2500,
  2600,
  2700,
  2800,
  2900,
  3000,
  3100,
  3200,
  3300]}

In [5]:
# Creating a DataFrame based on the dictionary
df_sales = pd.DataFrame.from_dict(dict_sales)

In [6]:
# Viewing the DataFrame
df_sales

Unnamed: 0,month,sales
0,January,2000
1,February,2200
2,March,2300
3,April,2500
4,May,2600
5,June,2700
6,July,2800
7,August,2900
8,September,3000
9,October,3100


In [7]:
# Checking the DataFrame structure
df_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   month   12 non-null     object
 1   sales   12 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 324.0+ bytes


In [8]:
# Adding a numerical column to represent the months
df_sales['month_number'] = range(1, 13)

In [9]:
# Viewing the updated DataFrame
df_sales

Unnamed: 0,month,sales,month_number
0,January,2000,1
1,February,2200,2
2,March,2300,3
3,April,2500,4
4,May,2600,5
5,June,2700,6
6,July,2800,7
7,August,2900,8
8,September,3000,9
9,October,3100,10


In [10]:
# Viewing the updated DataFrame structure
df_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   month         12 non-null     object
 1   sales         12 non-null     int64 
 2   month_number  12 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 420.0+ bytes


### Defining `X` and `y` variables
By convention, we split our data into **independent variables** (or **features**), and **dependent variables** (or **target variables**). The independent variables are used for training the model, and the dependent variable is what we aim to predict.

#### **X (independent variable)**
`X` represents the **features** (or **independent variables**) that we provide to the model to make predictions. In practical terms, `X` is a matrix containing the input data, where each row corresponds to a sample, and each column corresponds to a feature or characteristic of that sample.

#### **y (dependent variable / target variable)**
`y` represents the **target variable** (or **dependent variable**) that the model tries to predict based on the input data in `X`. In classification problems, `y` would be the class (or category) that we want to predict. In regression problems, `y` would be the numerical value that we want to estimate.

In [11]:
# Setting `X` and `y` variables

# X - the feature(s) we'll provide to the model for making predictions (in this case, the month numbers)
# y - the target variable we want the model to predict (in this case, the sales)
# In other words, we want to predict the sales for a given month based on the month number
X = pd.DataFrame(df_sales['month_number'])
y = pd.DataFrame(df_sales['sales'])

# Splitting the data into training and testing sets

# We need to divide the data into two sets: one for training the model and another for testing it
# We'll use 80% of the data to train the model and 20% to test its performance
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

For some reason, creating `X` and `y` variables above only with `df_sales['<column_name>']` was generating `Series` objects instead of `DataFrame` objects, even with `df_sales` already being a DataFrame. Scikit-learn's Linear Regression model expects these variables to be 2D arrays (like DataFrames, even with just a single column), while a Series object is 1D.

To correct this, I made sure `X` and `y` were created as DataFrames by adding `pd.DataFrame` around the column selection for both variables.

In [12]:
# Checking the subsets length
print(f'X_train length: {len(X_train)}\nX_test length: {len(X_test)}\ny_train length: {len(y_train)}\ny_test length: {len(y_test)}')

X_train length: 9
X_test length: 3
y_train length: 9
y_test length: 3


In [13]:
# Choosing a Linear Regression model
model = LinearRegression()

# Training the choosen model with the train subsets to find a correlation between `X` and `y`
model.fit(X_train, y_train)

### Testing the new Linear Regression model

We have created the training and testing subsets. The model has seen the **training subsets** and learned the correlation between `X_train` (the features) and `y_train` (the targets), and now we want to see if, based on this knowledge, it can accurately predict values for a subset it hasn't seen yet (`X_test`).

By convention, the predictions made by the model will be called `y_pred`, and we will compare these predictions with our real results (`y_test`) to assess how well the model performed.

In [14]:
# Generating predictions for the test subset
y_pred = model.predict(X_test)

In [15]:
# Viewing predictions
y_pred

array([[3222.48603352],
       [3113.12849162],
       [2128.91061453]])

In [16]:
# Viewing our real results
y_test

Unnamed: 0,sales
10,3200
9,3100
0,2000


### Seeing the generated coefficients for this model
The **straight line equation** for a Linear Regression is:

$y = ax + b$

Where $a$ is the **angular coefficient** (or **slope**) of the equation, and $b$ is the **linear coefficient** (or **intercept**) of the equation.

There can be many values for $a$, depending on the number of **independent variables (features)** in our model. Each independent variable will have its own $a$ (**slope**), representing its individual impact on the predicted value of `y`.

However, there will be only a single value for $b$ (**intercept**), which represents the value of `y` when all independent variables are equal to zero.

This is a Simple Linear Regression, with only one independent variable (or feature), so we'll have only a single value for $a$.

In [17]:
# Showing the values of `a` for each `x`
model.coef_

array([[109.3575419]])

The value of $a$ (the **angular coefficient** or **slope** of this equation) is `109.3575419`.

In [18]:
# Showing the value of `b`
model.intercept_

array([2019.55307263])

The value of $b$ (the **linear coefficient** or **intercept** of this equation) is `2019.55307263`.