# INTRODUCTION

Regression analysis is a statistical technique that is used to predict a continuous variable from one or more independent variables. The dependent variable is the variable that is being predicted, and the independent variables are the variables that are used to predict the dependent variable.

In this machine learning regression project, the goal is to develop a model that can accurately predict the value of the dependent variable based on the values of the independent variables. The model is developed by training the algorithm on a dataset of historical data. The algorithm learns from the data and identifies patterns that can be used to predict the value of the dependent variable.

Once the model is trained, it can be used to predict the value of the dependent variable for new data points. This can be used to make decisions about future outcomes, such as predicting sales, forecasting demand, or assessing risk.

# BUSINESS UNDERSTANDING

This is a time series forecasting problem. In this project, we'll predict store sales on data from **Corporation Favorita**, a large Ecuadorian-based grocery retailer.

Specifically, we are to build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores.

The training data includes dates, store, and product information, whether that item was being promoted, as well as the sales numbers. Additional files include supplementary information that may be useful in building your models

# IMPORTATION

In [7]:
# Import necessary libraries

# Connect to server
import pyodbc
from dotenv import dotenv_values

# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

import warnings

# ignore warnings
warnings.filterwarnings('ignore')

# DATA LOADING

Create a .env file in root of the folder of the project and all the sensitive information in the .env file

In [3]:
# Load environment variables from a .env file
env_variables = dotenv_values('../.env')

server= env_variables.get('SERVER')
database= env_variables.get('DATABASE')
username= env_variables.get('USERNAME')
password= env_variables.get('PASSWORD')

Create a .gitignore file and type '/.env/' file we just created. This will prevent git from tracking the file.

Create a connection by accessing your connection string with your defined environment variables 

In [4]:
# Setup connection string to connect to the remote server
connection_string = pyodbc.connect(f'DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}')

Setup the links and queries to read the data from the various sources

In [5]:
# Query statement to fetch oil, holidays_events and stores data from the remote server
oil_query = 'SELECT * FROM dbo.oil'
holidays_query = 'SELECT * FROM dbo.holidays_events'
stores_query = 'SELECT * FROM dbo.stores'

In [8]:
# Read all data from different sources
oil = pd.read_sql_query(oil_query, connection_string, parse_dates=['date'])
holidays_events = pd.read_sql_query(holidays_query, connection_string, parse_dates=['date'])
stores = pd.read_sql_query(stores_query, connection_string)
transactions = pd.read_csv('./data/transactions.csv', parse_dates=['date'])
train = pd.read_csv('./data/train.csv', parse_dates=['date'])
test = pd.read_csv('./data/test.csv', parse_dates=['date'])