You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as `penguins.csv`

**Origin of this data** : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

**The dataset consists of 5 columns.**

Column | Description
--- | ---
culmen_length_mm | culmen length (mm)
culmen_depth_mm | culmen depth (mm)
flipper_length_mm | flipper length (mm)
body_mass_g | body mass (g)
sex | penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are **at least three** species that are native to the region: **Adelie**, **Chinstrap**, and **Gentoo**.  Your task is to apply your data science skills to help them identify groups in the dataset!

In [50]:
# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler




In [51]:
# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
print(penguins_df.head()),
penguins_df.info()
penguins_df.describe()
print(penguins_df.isna().sum())



   culmen_length_mm  culmen_depth_mm  flipper_length_mm  body_mass_g     sex
0              39.1             18.7              181.0       3750.0    MALE
1              39.5             17.4              186.0       3800.0  FEMALE
2              40.3             18.0              195.0       3250.0  FEMALE
3              36.7             19.3              193.0       3450.0  FEMALE
4              39.3             20.6              190.0       3650.0    MALE
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 332 entries, 0 to 331
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   culmen_length_mm   332 non-null    float64
 1   culmen_depth_mm    332 non-null    float64
 2   flipper_length_mm  332 non-null    float64
 3   body_mass_g        332 non-null    float64
 4   sex                332 non-null    object 
dtypes: float64(4), object(1)
memory usage: 13.1+ KB
culmen_length_mm     0
culmen_depth_mm      0

In [52]:
penguins_df.sex.unique()
#CREATING DUMMY VARIABLES(CONVERTING CATEGORICAL FEATURES INTO NUMERIC VALUES)
# sex_map = {'MALE': 0, 'FEMALE': 1}
# penguins_df.sex = penguins_df.sex.map(sex_map)
# penguins_df.sex.unique()
sex_dummies = pd.get_dummies(penguins_df.sex, dtype=int)
penguins_df = pd.concat([penguins_df, sex_dummies], axis=1)
penguins_df

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,FEMALE,MALE
0,39.1,18.7,181.0,3750.0,MALE,0,1
1,39.5,17.4,186.0,3800.0,FEMALE,1,0
2,40.3,18.0,195.0,3250.0,FEMALE,1,0
3,36.7,19.3,193.0,3450.0,FEMALE,1,0
4,39.3,20.6,190.0,3650.0,MALE,0,1
...,...,...,...,...,...,...,...
327,47.2,13.7,214.0,4925.0,FEMALE,1,0
328,46.8,14.3,215.0,4850.0,FEMALE,1,0
329,50.4,15.7,222.0,5750.0,MALE,0,1
330,45.2,14.8,212.0,5200.0,FEMALE,1,0
