## Pandas Tutorial

Welcome to this Pandas tutorial!

In this notebook, we will explore the basics of Pandas, a powerful Python library for data analysis and manipulation. With its diverse functionalities and intuitive syntax, Pandas is a must-have tool for every data scientist and analyst.

In [None]:
import pandas as pd

**Mounting Google Drive in Colab**

When using Google Colab, we often need to access files from our Google Drive. The following code allows us to mount our Google Drive into the Colab environment.

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

In [None]:
import os
root_dir = "/content/gdrive/MyDrive/lecture/[2024-2] 머신러닝 원리와 응용/lab2_pandas"

# Checking if our specified directory exists
os.path.exists(root_dir)

**Loading the Dataset**

Before we start any data analysis, we first need to load our dataset. In this session, we'll be working with the `titanic_train.csv` dataset.

In [None]:
# Specify the file name
file_name = "titanic_train.csv"

# Load the dataset using pandas
df = pd.read_csv(os.path.join(root_dir, file_name))

**Quick Data Exploration**

Now that we've loaded our dataset, let's take a quick peek at its contents.

In [None]:
# Display the entire dataset
df

In [None]:
# Display the first 5 rows of the dataset
df.head(n=5)

In [None]:
# Display the last 5 rows of the dataset
df.tail(n=5)

**Gathering Basic Information**

Understanding the structure and types of data in our dataset is crucial before diving deep into analysis.

In [None]:
# Get a concise summary of the dataset
df.info()

In [None]:
# Get statistical summary for numerical columns
df.describe()

In [None]:
# Check the shape of the dataset
df.shape

In [None]:
# Display the column names
print(df.columns)
# Display the row names (index)
print(df.index)

**Accessing Data Columns**

You can access columns in a dataframe in multiple ways. Let's explore them.

In [None]:
# Accessing the 'Age' column using dot notation
print(df.Age)
# Accessing the 'Age' column using bracket notation
print(df["Age"])

In [None]:
# Checking the type of the 'Age' column
type(df.Age)

In [None]:
# Specifying a list of column names to view
vars = ["Name", "Age"]
df[vars]

**Accessing Data Rows**

In [None]:
# Label-based indexing
df.loc[0]

In [None]:
# Index-based (integer-based) indexing
# Accessing the first row of the dataframe
df.iloc[0]

In [None]:
# Accessing the first three rows of the dataframe
df.iloc[0:3]

In [None]:
# Sorting the dataframe based on the 'Age' column in ascending order
sorted_df = df.sort_values("Age", inplace=False, ascending=True)
sorted_df

In [None]:
# Accessing the data of the original first row using 'loc'
sorted_df.loc[0]

In [None]:
# Accessing the data of the first row in the sorted dataframe using 'iloc'
sorted_df.iloc[0]

**Filtering**

In [None]:
# Filter the DataFrame 'df' to only include rows where the 'Age' column value is greater than 30.
df[df['Age'] > 30]

**Applying functions**

In [None]:
# Define a function to categorize individuals as 'Adult' if their age is above 18, and 'Child' otherwise.
def age_category(age):
    return 'Adult' if age > 18 else 'Child'

# Apply the age_category function to the 'Age' column of the DataFrame 'df' and store the results in a new column named 'Category'.
df['Category'] = df['Age'].apply(age_category)
df

**Grouping and Aggregation**

In [None]:
# Group the DataFrame 'df' by the 'Sex' column.
grouped = df.groupby('Sex')

# Calculate the mean for each numeric column within each gender group.
grouped["Age"].mean()

**String operations**

In [None]:
# Convert all characters in the 'Name' column to uppercase.
df['Name'].str.upper()

In [None]:
# Check if each entry in the 'Name' column contains the substring 'Smith'.
df['Name'].str.contains('Smith')

**Data Distribution**

To examine the distribution of a categorical variable using a frequency table, utilize `value_counts()`.

In [None]:
df['Name'].str.contains('Smith').value_counts()

Visualizations provide a quick overview of the data distribution.

In [None]:
# Plotting a histogram for the 'Age' column
df.Age.hist(bins=20)

## Misc

Creating DataFrames:

In [None]:
# Creating a simple dataframe with various data types.
data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': ['a', 'b', 'c']
}

df = pd.DataFrame(data)
df

Setting and Resetting Index:

In [None]:
# Setting 'A' column as the index
df.set_index('A', inplace=True)
df

In [None]:
# Resetting the index back to default
df.reset_index(inplace=True)
df

Concatenating DataFrames:

In [None]:
df1 = pd.DataFrame({
    'Class1': [95, 92, 98, 100],
    'Class2': [91, 93, 96, 99]
})

# Create the second DataFrame to append
df2 = pd.DataFrame({
    'Class1': [87, 89],
    # 'Class2': [85, 90]
})

# `append` method is deprecated and will be removed in future versions. It's recommended to use `pd.concat` instead.
# df_appended = df1.append(df2, ignore_index=True)
df_appended = pd.concat([df1, df2], ignore_index=True)
df_appended

In [None]:
df1 = pd.DataFrame({
    'Class1': [95, 92, 98, 100],
    'Class2': [91, 93, 96, 99]
})

# Create the second DataFrame to append
df2 = pd.DataFrame({
    'Class3': [87, 89],
})

df_joined = df1.join(df2)
df_joined

In [None]:
# Employees DataFrame
df_employees = pd.DataFrame({
    'FirstName': ['John', 'Jane', 'Alice', 'Bob'],
    'LastName': ['Doe', 'Smith', 'Johnson', 'Lee'],
    'Department': ['HR', 'Finance', 'IT', 'Marketing']
})

# Salaries DataFrame
df_salaries = pd.DataFrame({
    'FirstName': ['John', 'Jane', 'Alice', 'Bob', 'John'],
    'LastName': ['Doe', 'Smith', 'Johnson', 'Lee', 'Williams'],
    'Salary': [50000, 60000, 70000, 55000, 65000]
})

print("Employees DataFrame:")
print(df_employees)
print("\nSalaries DataFrame:")
print(df_salaries)

# Merge on multiple columns
df_merged = pd.merge(
    df_employees,
    df_salaries,
    on=['FirstName', 'LastName'],
    how='inner'  # You can choose 'left', 'right', 'outer' as needed
)

print("\nMerged DataFrame (Inner Join):")
print(df_merged)

Handling Missing Data:

In [None]:
# Remove rows with NaN values
df.dropna(inplace=True)
# Replace NaN values with 0
df.fillna(value=0, inplace=True)

df

Saving:

In [None]:
# Saving the modified dataframe to a CSV file
df.to_csv(os.path.join(root_dir, 'example.csv'))