<a href="https://colab.research.google.com/github/sifat-AR/MachineLearningCS432/blob/main/DataExploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Machine Learning Lab - CSE 432**

# 2 Data Exploration with pandas

**Pandas** is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.

In this lab, we will learn how to explore and manipulate data with the pandas library.

**2.1 Installing and Importing Pandas**

The following command is used to install pandas to your python system. If pandas is already installed, there is no need to run this command.

In [None]:
#!pip install pandas

In [None]:
import pandas as pd

**2.2 Importing a data set**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/COURSES/ML_CSE432_Lab/02 Data Exploration/titanic.csv')
print(df)

In [None]:
df

In [None]:
pd.set_option('display.max_rows', None)
df

In [None]:
pd.set_option('display.max_rows', 10)
df

**2.3 Viewing and Understanding Dataset**

In [None]:
df.head(10)

In [None]:
df.tail(5)

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.isnull().sum()

In [None]:
df.nunique()

**2.4 Slicing and Extracting Data**

In [None]:
# Isolating a single column
df['name'].head()

In [None]:
# Isolating multiple columns
df[['name', 'sex', 'age']].head()

In [None]:
# Isolating a single row
df.loc[5]

In [None]:
# Isolating multiple rows with range
df.loc[5:20]

In [None]:
# Isolating multiple rows with list
df.loc[[1, 5, 10, 6]]

In [None]:
dft = df.loc[5:20].copy()
dft.head()

In [None]:
try:
    print(dft.loc[0])
except:
    print("There is an error")

In [None]:
try:
    print(dft.iloc[0])
except:
    print("There is an error")

In [None]:
# Isolating both rows and columns with range
# loc[] doesn't work with column range but iloc[] does
df.iloc[0:20, 2:5].head()

In [None]:
# Isolating both rows and columns with column names in list
# loc[] works but .iloc[] doesn't
df.loc[0:20, ['name', 'sex', 'age']].head()

In [None]:
# That's how we change specific attribute of a single row
dft.loc[5, 'name'] = 'Ratri'

# Change these attributes
dft.loc[5, 'sex'] = 'female'
dft.loc[5, 'age'] = 26
dft.loc[5, 'survived'] = 1
dft.head()

In [None]:
# We can sort the data by an attribute
dft.sort_values(by='age', ascending=True).head()

**2.5 Slicing with Conditions**

Pandas slicing with conditions is a technique to select a subset of data from a pandas DataFrame or Series based on some criteria. For example, we can use the loc or iloc methods to slice by index labels or positions, and combine them with boolean masks to filter by values. Alternatively, we can use the query method to write the conditions as a string expression (which we will not use here).

In [None]:
# Let's find the underage people in the data set
df.loc[df['age']<18].head()

In [None]:
# Now let's find underage people who are also female
df.loc[ (df['age']<20) & (df['sex']=='female')].head()

**2.6 Cleaning Data**

2.6.1 Dropping Missing Values

In [None]:
# We will copy the dataframe first in order to not change the original
dft = df.copy()
dft.head()

In [None]:
dft.isnull().sum()

In [None]:
# Droppping missing values
dft = dft.dropna()
dft.head()

In [None]:
dft = df.copy()
dft = dft.dropna(axis=1)
dft.head()

2.6.2 Replacing Missing Values

In [None]:
dft = df.copy()
dft.loc[dft['age'].isnull()].head()

In [None]:
# Get the mean of age
mean_value = dft['age'].mean()
# Fill missing values using .fillna()
df5t = dft.fillna(mean_value)
dft.loc[dft['age'].isnull()].head()

In [None]:
df5t.loc[dft['age'].isnull()].head()

2.6.3 Removing Duplicates

In [None]:
dft = df.copy()
dft.head()

In [None]:
dft = dft.drop_duplicates()
dft.head()

**2.7 Manipulating Columns**

In [None]:
# We can rename a column using the rename() method
dft = dft.rename(columns = {'home.dest':'destination'})
dft.head()

In [None]:
# Now we will add a new column named baby
# baby is a boolean value column (1 or 0)
# It will show whether a passenger was a baby or not
dft['baby'] = 1
dft.loc[dft['age']>10, 'baby'] = 0
dft.head()

**2.8 Further Analysis**

In [None]:
# We can find mean, median, mode
df['age'].mean()

In [None]:
# Find mean of age of females and males individually
df.groupby('sex')['age'].mean()