### Tasks
Analyze a database and, based on it:

* Process the data
* Format (if necessary)
* Train the model
* Predict December sale
* Create a histogram of the data
* Create a scatter plot

In [30]:
# Importing libs
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [31]:
# Bringing the dictionary to be used
dict_sales = {
    'month': ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'],
    'sales': [2000, 2200, 2300, 2500, 2600, 2700, 2800, 2900, 3000, 3100, 3200, 3300]
}

In [32]:
dict_sales

{'month': ['January',
  'February',
  'March',
  'April',
  'May',
  'June',
  'July',
  'August',
  'September',
  'October',
  'November',
  'December'],
 'sales': [2000,
  2200,
  2300,
  2500,
  2600,
  2700,
  2800,
  2900,
  3000,
  3100,
  3200,
  3300]}

In [33]:
# Creating a DataFrame based on the dictionary
df_sales = pd.DataFrame.from_dict(dict_sales)

In [34]:
# Viewing the DataFrame
df_sales

Unnamed: 0,month,sales
0,January,2000
1,February,2200
2,March,2300
3,April,2500
4,May,2600
5,June,2700
6,July,2800
7,August,2900
8,September,3000
9,October,3100


In [35]:
# Checking the DataFrame structure
df_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   month   12 non-null     object
 1   sales   12 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 324.0+ bytes


In [36]:
# Adding a numerical column to represent the months
df_sales['month_number'] = range(1, 13)

In [37]:
# Viewing the updated DataFrame
df_sales

Unnamed: 0,month,sales,month_number
0,January,2000,1
1,February,2200,2
2,March,2300,3
3,April,2500,4
4,May,2600,5
5,June,2700,6
6,July,2800,7
7,August,2900,8
8,September,3000,9
9,October,3100,10


In [38]:
# Viewing the updated DataFrame structure
df_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   month         12 non-null     object
 1   sales         12 non-null     int64 
 2   month_number  12 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 420.0+ bytes


In [39]:
# Splitting the data into training and testing sets
X = pd.DataFrame(df_sales['month_number'])
y = pd.DataFrame(df_sales['sales'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

For some reason, creating `X` and `y` variables only with `df_sales['<column_name>']` was generating `Series` objects instead of `DataFrame` objects, even with `df_sales` already being a DataFrame. Scikit-learn's Linear Regression model expects these variables to be 2D arrays (like DataFrames, even with just a single column), while a Series object is 1D.

To correct this, I made sure `X` and `y` were created as DataFrames by adding `pd.DataFrame` around the column selection for both variables.

In [43]:
# Choosing and training a Linear Regression model
model = LinearRegression().fit(X_train, y_train)