# Introdunction to Supervised Machine Learning
Supervised learning algorithms are a category of machine learning models that rely on labeled data for training. In supervised learning, the algorithm learns to map inputs to outputs by analyzing a dataset that includes both the input features and the correct corresponding labels (the output). After the model is trained, it can be used to predict the label for new, unseen data.

In Supervised Machine Learning:
- Machines are trained with labelled data as input.
- The Machine learning model identify:
    - Patterns and methods
    - Learn from them
    - Predicts output
    
- Supervised Machine learning algorithms includes:
    - Linear and Logistic regression
    - Multi-class classification
    - Decision trees
    - Support vector machines
    
- Supervised machine learning  is used to detect:
    - Forest fires
    - Oil & gas tragedies
    - Shipping fires
    - Building fires
    
If we use the Supervised machine learning model to detection of fire incident in that case we have to include some parameters like:
- Area spread of fire
- smoke levels
- Temperature
- Rate of increase of spread
- Outcome labelled as fire incident or not

## Algorithms of Supervised Machine learning

Supervised machine have a two types of algorithms:
- Classification
- Regression


### Classification
Classification algorithm segregates data into two or more categories with one or more inputs, a classification model predicts the value of one or more outcomes.

we are use the classification model to segregate the emails into Spam or Ham, so first model examines the data to find patterns and methods to determine whether email is a Ham or a Spam

### Regression
Regression algorithms establishes the relationship between input and output variables which is suitable for situations where the output variable is a real or continuous value.

A regression algorithm is used to forecast or predict the value of the stock market.



# Application of Supervised Machine Learning
- Quality inspection in manufacturing: Inspects quality and classify manufactured products in the good condition or damaged condition.
- Forecasting for the maritme industry
- Fraud protection measurs
- Waste management Systems
- Healthcare
- Marketing and Sales
- Retail and E-Commerce
- Product Recommendation
- Price Optimization

# Preparing and Shaping Data

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
df = sns.load_dataset('titanic') # load the titanic data set

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [6]:
df['Travelalone'] = np.where((df['sibsp'] + df['parch'])> 0, 0, 1).astype('uint8')

In [7]:
df1 = df.drop(['alive', 'alone', 'who', 'sibsp', 'parch', 'deck', 'class'], axis = 1)

In [8]:
df1.isna().sum()

survived         0
pclass           0
sex              0
age            177
fare             0
embarked         2
adult_male       0
embark_town      2
Travelalone      0
dtype: int64

In [9]:
df['age'].fillna(df1['age'].median(skipna = True), inplace = True)

In [10]:
df1.head()

Unnamed: 0,survived,pclass,sex,age,fare,embarked,adult_male,embark_town,Travelalone
0,0,3,male,22.0,7.25,S,True,Southampton,0
1,1,1,female,38.0,71.2833,C,False,Cherbourg,0
2,1,3,female,26.0,7.925,S,False,Southampton,1
3,1,1,female,35.0,53.1,S,False,Southampton,0
4,0,3,male,35.0,8.05,S,True,Southampton,1


In [11]:
df_titanic = pd.get_dummies(df1, columns = ['pclass', 'embarked', 'sex'], drop_first = True)

In [12]:
df_titanic.head()

Unnamed: 0,survived,age,fare,adult_male,embark_town,Travelalone,pclass_2,pclass_3,embarked_Q,embarked_S,sex_male
0,0,22.0,7.25,True,Southampton,0,0,1,0,1,1
1,1,38.0,71.2833,False,Cherbourg,0,0,0,0,0,0
2,1,26.0,7.925,False,Southampton,1,0,1,0,1,0
3,1,35.0,53.1,False,Southampton,0,0,0,0,1,0
4,0,35.0,8.05,True,Southampton,1,0,1,0,1,1


In [13]:
df_titanic = df_titanic.drop(['embark_town', 'adult_male'], axis = 1)

In [14]:
df_titanic.head()

Unnamed: 0,survived,age,fare,Travelalone,pclass_2,pclass_3,embarked_Q,embarked_S,sex_male
0,0,22.0,7.25,0,0,1,0,1,1
1,1,38.0,71.2833,0,0,0,0,0,0
2,1,26.0,7.925,1,0,1,0,1,0
3,1,35.0,53.1,0,0,0,0,1,0
4,0,35.0,8.05,1,0,1,0,1,1


In [15]:
x = df_titanic.drop(['survived'], axis = 1)
y = df_titanic['survived']

In [16]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

#### MinMaxScaler:

It transforms features by scaling each feature to a given range, typically between 0 and 1. This is useful when you want all features to be in the same range without changing their distribution.

#### StandardScaler:

This scaler standardizes features by removing the mean and scaling to unit variance (mean = 0 and variance = 1). It is commonly used when the data follows a Gaussian distribution or for algorithms that assume normally distributed data.

In [17]:
trans_MM = MinMaxScaler()
trans_SS = StandardScaler()

In [18]:
df_MM = trans_MM.fit_transform(x)
pd.DataFrame(df_MM)

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.271174,0.014151,0.0,0.0,1.0,0.0,1.0,1.0
1,0.472229,0.139136,0.0,0.0,0.0,0.0,0.0,0.0
2,0.321438,0.015469,1.0,0.0,1.0,0.0,1.0,0.0
3,0.434531,0.103644,0.0,0.0,0.0,0.0,1.0,0.0
4,0.434531,0.015713,1.0,0.0,1.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...
886,0.334004,0.025374,1.0,1.0,0.0,0.0,1.0,1.0
887,0.233476,0.058556,1.0,0.0,0.0,0.0,1.0,0.0
888,,0.045771,0.0,0.0,1.0,0.0,1.0,0.0
889,0.321438,0.058556,1.0,0.0,0.0,0.0,0.0,1.0


In [19]:
df_SS = trans_SS.fit_transform(x)
pd.DataFrame(df_SS)

Unnamed: 0,0,1,2,3,4,5,6,7
0,-0.530377,-0.502445,-1.231645,-0.510152,0.902587,-0.307562,0.619306,0.737695
1,0.571831,0.786845,-1.231645,-0.510152,-1.107926,-0.307562,-1.614710,-1.355574
2,-0.254825,-0.488854,0.811922,-0.510152,0.902587,-0.307562,0.619306,-1.355574
3,0.365167,0.420730,-1.231645,-0.510152,-1.107926,-0.307562,0.619306,-1.355574
4,0.365167,-0.486337,0.811922,-0.510152,0.902587,-0.307562,0.619306,0.737695
...,...,...,...,...,...,...,...,...
886,-0.185937,-0.386671,0.811922,1.960202,-1.107926,-0.307562,0.619306,0.737695
887,-0.737041,-0.044381,0.811922,-0.510152,-1.107926,-0.307562,0.619306,-1.355574
888,,-0.176263,-1.231645,-0.510152,0.902587,-0.307562,0.619306,-1.355574
889,-0.254825,-0.044381,0.811922,-0.510152,-1.107926,-0.307562,-1.614710,0.737695


# Overfitting & Underfitting

### Bias
- Bias is an error introduced in the model.
- High bias is a big difference between the actual and predicted values and high basa is not good for our model.
- Low bais indicates that the difference between the actual and predicted values is low.

### Variance
- Variance indicates how scattered data is.
    - High variance indicates more scattered data.
    - Low vairance indicates less scattered data.

## Overfitting
- Overfit indicates a low bias and high variance in the data.
- Overfitting happens when a model focuses on too many details in the training dataset.
    - It has a negative impact on the performance of the model on a new dataset.
    
## Underfitting
- Underfitting is high bias and high variance in the data.
- Underfitting is easily detectable because it exhibit poor performance on the training dataset.
- A model is underfit if it is trained with limited features

**If a model performs well on training data and testing data then we can say our model is good.**

**If a model performs well with training data but not with testing data, it is called as Overfit**

**If a model does not perform well on both training data and testig data, it is called as Underfit**

# Detecting and Preventing Overfitting and Underfitting

In [20]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from matplotlib import pyplot

In [21]:
X, y = make_classification(n_samples = 9000, n_features=18 ,n_informative= 4, n_redundant= 12, random_state= 4)

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [23]:
train_scores, test_scores = list(), list()

In [24]:
values = [i for i in range(1, 21)]

In [25]:
for i in values:
    model = DecisionTreeClassifier(max_depth= i)
    model.fit(X_train, y_train)
    train_yhat = model.predict(X_train)
    train_acc = accuracy_score(y_train, train_yhat)
    test_yhat = model.predict(X_test)
    test_acc = accuracy_score(y_test, test_yhat)
    train_scores.append(train_acc)
    test_scores.append(test_acc)
    print('>>%d, train: %.3f, test: %.3f' %(i, train_acc, test_acc))

>>1, train: 0.840, test: 0.827
>>2, train: 0.856, test: 0.841
>>3, train: 0.888, test: 0.881
>>4, train: 0.908, test: 0.901
>>5, train: 0.921, test: 0.906
>>6, train: 0.931, test: 0.910
>>7, train: 0.949, test: 0.921
>>8, train: 0.959, test: 0.930
>>9, train: 0.966, test: 0.932
>>10, train: 0.976, test: 0.935
>>11, train: 0.983, test: 0.940
>>12, train: 0.987, test: 0.944
>>13, train: 0.991, test: 0.940
>>14, train: 0.994, test: 0.943
>>15, train: 0.996, test: 0.941
>>16, train: 0.998, test: 0.939
>>17, train: 0.999, test: 0.939
>>18, train: 0.999, test: 0.943
>>19, train: 0.999, test: 0.941
>>20, train: 1.000, test: 0.940


# hello 
- aldjfjasjfdjfj
    - asjfdjj


## asdjfj
### ladfj
#### akjsdfj