# Penguins Dataset Analysis

This notebook demonstrates various data analysis techniques using the `penguins` dataset from the `seaborn` library. It covers basic pandas operations such as data inspection, selection, filtering, and aggregation.

## Objectives
- Load and inspect the dataset.
- Perform data selection and filtering.
- Summarize data statistics.
- Group and aggregate data.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = sns.load_dataset("penguins")
df.head(5)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### 1. Print first 25 entries

**Purpose:** To inspect the beginning of the dataset to get an initial understanding of the data values and structure.
**Method:** The `head(n)` method returns the first `n` rows of the DataFrame.

In [2]:
df.head(25)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,Female
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,


### 2. Print last 10 entries

**Purpose:** To check the end of the dataset, which is useful for verifying data completeness and checking for any appending issues.
**Method:** The `tail(n)` method returns the last `n` rows of the DataFrame.

In [3]:
df.tail(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
334,Gentoo,Biscoe,46.2,14.1,217.0,4375.0,Female
335,Gentoo,Biscoe,55.1,16.0,230.0,5850.0,Male
336,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,
337,Gentoo,Biscoe,48.8,16.2,222.0,6000.0,Male
338,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female
343,Gentoo,Biscoe,49.9,16.1,213.0,5400.0,Male


### 3. Print number of rows and columns of the data frame

**Purpose:** To know the dimensions of the dataset, i.e., how many samples (rows) and features (columns) are present.
**Method:** The `shape` attribute returns a tuple representing the dimensionality of the DataFrame in the format `(rows, columns)`.

In [4]:
f"No. of rows : {df.shape[0]} and column : {df.shape[1]}"

'No. of rows : 344 and column : 7'

### 4. Print names of all columns

**Purpose:** To get a list of all feature names in the dataset for reference.
**Method:** The `columns` attribute returns an Index object containing the column labels.

In [5]:
df.columns

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')

### 5. Print the data type of each column

**Purpose:** To understand the data type of each column (e.g., integer, float, object/string), which is crucial for determining appropriate analysis and preprocessing steps.
**Method:** The `dtypes` attribute returns a Series with the data type of each column. `reset_index()` is used here to display the result as a DataFrame for better readability.

In [None]:
# reset_index convers the series in data frames
# df.dtypes

df.dtypes.reset_index()

Unnamed: 0,index,0
0,species,object
1,island,object
2,bill_length_mm,float64
3,bill_depth_mm,float64
4,flipper_length_mm,float64
5,body_mass_g,float64
6,sex,object


### 6. Print only species column

**Purpose:** To isolate and examine the data within a specific column.
**Method:** Columns can be accessed as attributes (e.g., `df.species`) or using dictionary-style indexing (e.g., `df['species']`). `reset_index()` is used to format the output as a DataFrame.

In [7]:
df.species.reset_index()

Unnamed: 0,index,species
0,0,Adelie
1,1,Adelie
2,2,Adelie
3,3,Adelie
4,4,Adelie
...,...,...
339,339,Gentoo
340,340,Gentoo
341,341,Gentoo
342,342,Gentoo


In [None]:
# to rename columns
df.species.reset_index().rename(columns={"index": "Column_Names"})

Unnamed: 0,Column_Names,species
0,0,Adelie
1,1,Adelie
2,2,Adelie
3,3,Adelie
4,4,Adelie
...,...,...
339,339,Gentoo
340,340,Gentoo
341,341,Gentoo
342,342,Gentoo


### 7. How many different types of species are there in dataset

**Purpose:** To identify the distinct categories present in a categorical variable and count them.
**Method:** 
- `unique()` returns an array of unique values.
- `nunique()` returns the number of unique values.

In [None]:
# dataframe.column_name.unique() returns an array of unique element of that column
# like : array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)
list(df.species.unique())

# to get the count of unique element you can write df.column_name.nunique() will return number of unique elements
df.species.nunique()

3

### 8. What is the most popular species

**Purpose:** To determine the mode (most frequent value) of a categorical variable.
**Method:** `value_counts()` returns a Series containing counts of unique values, sorted in descending order. `head(1)` selects the top entry, which is the most frequent.

In [None]:
# value_counts() will count the occurence of any column name in data frame
# results :
# species
# Adelie       152
# Gentoo       124
# Chinstrap     68
# Name: count, dtype: int64
df.species.value_counts().reset_index().head(1)

Unnamed: 0,species,count
0,Adelie,152


### 9. Summarize the data frame

**Purpose:** To obtain a statistical summary of the dataset, including measures of central tendency, dispersion, and distribution shape.
**Method:** `describe()` generates descriptive statistics. By default, it handles numeric columns. Using `include='all'` forces it to include summary statistics for all columns, including categorical ones (showing count, unique, top, freq).

In [None]:
# What it does:

# Normally, df.describe() only summarizes numeric columns (mean, std, min, max, etc.).

# By adding include='all', you tell pandas:
# → “Summarize all columns, including categorical (strings) and object types.”

df.describe(include="all")

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
count,344,344,342.0,342.0,342.0,342.0,333
unique,3,3,,,,,2
top,Adelie,Biscoe,,,,,Male
freq,152,168,,,,,168
mean,,,43.92193,17.15117,200.915205,4201.754386,
std,,,5.459584,1.974793,14.061714,801.954536,
min,,,32.1,13.1,172.0,2700.0,
25%,,,39.225,15.6,190.0,3550.0,
50%,,,44.45,17.3,197.0,4050.0,
75%,,,48.5,18.7,213.0,4750.0,


### 10. Summarize any one of the columns

**Purpose:** To get statistics for a specific feature.
**Method:** Calling `describe()` on a single column (Series) provides statistics relevant to that column's data type.

In [12]:
df.bill_length_mm.describe().reset_index()

Unnamed: 0,index,bill_length_mm
0,count,342.0
1,mean,43.92193
2,std,5.459584
3,min,32.1
4,25%,39.225
5,50%,44.45
6,75%,48.5
7,max,59.6


### 11. Find mean of data frame

**Purpose:** To calculate the average value for each numeric column.
**Method:** The `mean()` method computes the arithmetic mean. `numeric_only=True` is often required to avoid errors with non-numeric data in newer pandas versions.

In [None]:
df.mean(numeric_only=True)

### 12. How many different islands are there in this dataset and print their names

**Purpose:** To identify the unique locations (islands) where data was collected.
**Method:** `unique()` returns the unique values in the 'island' column.

In [13]:
df.island.unique()

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

### 13. Find total count of occurrence of each gender

**Purpose:** To see the distribution of samples across genders.
**Method:** `value_counts()` on the 'sex' column counts how many times each gender appears.

In [14]:
df.sex.value_counts()

sex
Male      168
Female    165
Name: count, dtype: int64

### 14. Print rows 3-9

**Purpose:** To select a specific range of rows by their integer position.
**Method:** `iloc` is used for integer-location based indexing. `3:10` selects rows from index 3 up to (but not including) 10.

In [None]:
df.iloc[3:10, 0:]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,Female
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,


### 15. Select every row after the fourth row and all columns

**Purpose:** To subset the data starting from a specific index to the end.
**Method:** `iloc[4:, :]` selects rows starting from index 4 to the end, and all columns.

In [None]:
df.iloc[4:, 0:]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,Female
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


### 16. Select every row up to 4th row and all columns

**Purpose:** To select the initial set of rows up to a specific index.
**Method:** `iloc[:4, :]` selects rows from the beginning up to (but not including) index 4.

In [None]:
df.iloc[:4, 0:]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,


### 17. Select every second row starting from the 5th row

**Purpose:** To select rows with a specific step (stride).
**Method:** `iloc[4::2, :]` selects rows starting from index 4 to the end, with a step of 2.

In [None]:
df.iloc[4::2, :]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,Female
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
10,Adelie,Torgersen,37.8,17.1,186.0,3300.0,
12,Adelie,Torgersen,41.1,17.6,182.0,3200.0,Female
...,...,...,...,...,...,...,...
334,Gentoo,Biscoe,46.2,14.1,217.0,4375.0,Female
336,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,
338,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female


### 18. Select rows where bill_length_mm is greater than 39 and sex is 'Male'

**Purpose:** To filter the dataset based on multiple conditions.
**Method:** Boolean indexing is used. Conditions are combined using the `&` (bitwise AND) operator. Each condition must be wrapped in parentheses.

In [None]:
df[(df.bill_length_mm > 39) & (df.sex == "Male")]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male
17,Adelie,Torgersen,42.5,20.7,197.0,4500.0,Male
19,Adelie,Torgersen,46.0,21.5,194.0,4200.0,Male
...,...,...,...,...,...,...,...
333,Gentoo,Biscoe,51.5,16.3,230.0,5500.0,Male
335,Gentoo,Biscoe,55.1,16.0,230.0,5850.0,Male
337,Gentoo,Biscoe,48.8,16.2,222.0,6000.0,Male
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male


### 19. Find records for only two species where bill_length_mm is greater than 20

**Purpose:** To filter data based on complex conditions involving multiple column values.
**Method:** Using boolean indexing with `&` (AND) and `|` (OR) operators.

In [None]:
df[(df.bill_depth_mm > 20) & (df.species == "Adelie") | (df.species == "Gentoo")].shape

(138, 7)

### 20. Select the rows with index 3 & 5

**Purpose:** To select specific rows by their index label.
**Method:** `iloc` with slicing and step, or `loc` for label-based indexing.

In [21]:
df.iloc[3:6:2]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
3,Adelie,Torgersen,,,,,
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [None]:
df.loc[3:5, "species":"bill_depth_mm"]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm
3,Adelie,Torgersen,,
4,Adelie,Torgersen,36.7,19.3
5,Adelie,Torgersen,39.3,20.6


### 21. For each island, calculate minimum and maximum flipper_length_mm

**Purpose:** To perform grouped aggregations.
**Method:** `groupby()` followed by `agg()` to apply multiple aggregation functions.

In [None]:
df.groupby("island").flipper_length_mm.agg({"sum", "count", "min", "max"}).reset_index()

Unnamed: 0,island,sum,min,max,count
0,Biscoe,35021.0,172.0,231.0,167
1,Dream,23941.0,178.0,212.0,124
2,Torgersen,9751.0,176.0,210.0,51
