# Simple Linear Regression

## Dataset
Penguin dataset is available with seaborn library and serves as replacement/alternate for iris dataset. This datasets contain data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.

In [1]:
# imports
import pandas as pd
import seaborn as sns

In [2]:
# Load dataset
penguins = sns.load_dataset("penguins")
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [4]:
penguins.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

In [6]:
penguins.describe()


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


## Missing Values 

In [26]:
# function to summarize missing values by columns
def missing_value_summary(df):
    for column in penguins.columns:
        print("{} - {} counts of missing values.".format(column, df[column].isna().sum()))

missing_value_summary(penguins)

species - 0 counts of missing values.
island - 0 counts of missing values.
bill_length_mm - 2 counts of missing values.
bill_depth_mm - 2 counts of missing values.
flipper_length_mm - 2 counts of missing values.
body_mass_g - 2 counts of missing values.
sex - 11 counts of missing values.


## Data Cleaning

For the analysis we only want to keep species Adelie and Gento

In [29]:
penguins_extract = penguins[penguins['species'] != "Chinstrap"]

# Dropping all the missing values from the extracted data
df = penguins_extract.dropna()

# resetting the index for rows after dropping the rows with missing values
df.reset_index(inplace=True, drop=True)

# displaying first 5 rows of the cleaned data
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
