---

<center><h1> Basics of Scikit Learn </h1></center>


---

- It provides simple and efficient tools for pre-processing and predictive modeling.


---

***Steps to built a model in scikit-learn.***

---

1. Import the model.
2. Prepare the data set.
3. Separate the independent and target variables.
4. Create an object of the model.
5. Fit the model with the data.
6. Use the model to predict target.


In [1]:
# import the scikit-learn library
import sklearn

***If you got an error while running the above cell, import it by using the following command.***

If you are using anaconda with python3: ***`!pip install scikit-learn`***

If you are using jupyter with python3: ***`!pip3 install scikit-learn`***

---

In [2]:
#check the version 
sklearn.__version__

'0.24.1'

- ***We have seen in the pandas notebook, that we have some missing values in out data.***
- ***We will impute those missing values using the scikit-learn imputer.***

In [4]:
# read the data set and check for the null values
import pandas as pd
data = pd.read_csv("C:/Users/sivak/OneDrive/Desktop/Books_Data/analytics vidhya/6.Python Libraries/5.sklearn/Dataset/big_mart_sales.csv")

In [5]:
data

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
1,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
3,FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store,732.3800
4,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
...,...,...,...,...,...,...,...,...,...,...,...,...
8518,FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1,2778.3834
8519,FDS36,8.380,Regular,0.046982,Baking Goods,108.1570,OUT045,2002,,Tier 2,Supermarket Type1,549.2850
8520,NCJ29,10.600,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136
8521,FDN46,7.210,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976


In [6]:
data.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [7]:
# import the SimpleImputer
from sklearn.impute import SimpleImputer

---

- For imputing the missing values, we will use `SimpleImputer`.
- First we will create an object of the imputer and define the strategy.
- We will impute the Item_Weight by `mean` value and Oulet_Size by `most_frequent` value.
- Fit the objects with the data.
- Transform the data.

---
---

In [9]:
# Create the object of the imputer for Item_Weight and Outlet_Size
impute_weight = SimpleImputer(strategy = 'mean')
impute_size = SimpleImputer(strategy = 'most_frequent')

In [11]:
# fit the Item_Weight imputer with the data and transform
impute_weight.fit(data[['Item_Weight']])
data.Item_Weight = impute_weight.transform(data[['Item_Weight']])

In [12]:
# fit the Outlet_Size imputer with the data and transfrom
impute_size.fit(data[['Outlet_Size']])
data.Outlet_Size = impute_size.transform(data[['Item_Weight']])

In [13]:
# check the Null Values
data.isna().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

---

- ***Now, after the preprocessing step, we separate the independent and target variable and pass the data to the model object to train the model.***

- ***If we have a problem in which we have to identify the category of an object based on some features. For example whether the given picture is of a cat or dog. These are `Classification Problems`.***

- ***Or, if we have to identify a continous attribute like predciting sales based on some features. These are `Regression Problems`.***

---

***`SCIKIT-LEARN` has tools to which help you build Regresssion, Classification models and many others.***

In [14]:
# Some of the very basic models scikit learn has 
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier

---

After we have build the model now whenever new data points are added to the existing data, we need to perform the same preprocessing steps again before we can use the model to make predictions. This becomes a tedious and time consuming process!

So, scikit-learn provides tools to create a pipeline of all those steps that will make your work a lot more easier.

---

In [15]:
from sklearn.pipeline import Pipeline

---

***Learn more about the scikit-learn here: https://scikit-learn.org/stable/index.html***

---