# Pandas Cheat Sheet for Hackathons

This notebook is a quick reference for the most essential Pandas functionalities for data manipulation and analysis, designed for use in hackathons. It covers creating data structures, I/O, inspection, selection, cleaning, and combining data.

## 1. Importing Pandas

In [1]:
# Import the pandas library with its conventional alias 'pd'
import pandas as pd
# Import numpy for sample data creation
import numpy as np

## 2. Core Data Structures: Series and DataFrame

Pandas is built around two primary data structures: the `Series` (1D) and the `DataFrame` (2D).

In [2]:
# Create a Series from a Python list
# A Series is a one-dimensional labeled array
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(f"Pandas Series:{s}")

# Create a DataFrame from a dictionary of lists
# A DataFrame is a two-dimensional labeled data structure with columns of potentially different types.
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print(f"Pandas DataFrame:{df}")

Pandas Series:0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
Pandas DataFrame:      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston


## 3. Reading and Writing Data

Pandas provides easy-to-use functions to read data from various file formats and write data back to them. CSV is one of the most common formats.

In [3]:
# Create a sample DataFrame to save
sample_df_to_save = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

# Write the DataFrame to a CSV file
# index=False prevents pandas from writing the DataFrame index as a column
sample_df_to_save.to_csv('sample.csv', index=False)
print("DataFrame saved to sample.csv")

# Read data from a CSV file into a new DataFrame
df_from_csv = pd.read_csv('sample.csv')
print("DataFrame read from sample.csv:")
print(df_from_csv)

DataFrame saved to sample.csv
DataFrame read from sample.csv:
   col1  col2
0     1     3
1     2     4


## 4. Inspecting Data

Quickly get a summary of your DataFrame's content.

In [4]:
# Use the DataFrame created earlier
print("Original DataFrame:")
print(df)

# Display the first 5 rows
print("--- First 2 rows (head) ---")
print(df.head(2))

# Display the last 5 rows
print("--- Last 2 rows (tail) ---")
print(df.tail(2))

# Get a concise summary of the DataFrame
# Includes index dtype, column dtypes, non-null values, and memory usage.
print("--- DataFrame Info ---")
df.info()

# Get descriptive statistics for numerical columns
print("--- Descriptive Statistics ---")
print(df.describe())

Original DataFrame:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston
--- First 2 rows (head) ---
    Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles
--- Last 2 rows (tail) ---
      Name  Age     City
2  Charlie   35  Chicago
3    David   40  Houston
--- DataFrame Info ---
<class 'pandas.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Name    4 non-null      str  
 1   Age     4 non-null      int64
 2   City    4 non-null      str  
dtypes: int64(1), str(2)
memory usage: 228.0 bytes
--- Descriptive Statistics ---
             Age
count   4.000000
mean   32.500000
std     6.454972
min    25.000000
25%    28.750000
50%    32.500000
75%    36.250000
max    40.000000


## 5. Data Selection and Indexing

Pandas offers multiple ways to select and index data, including label-based (`.loc`), position-based (`.iloc`), and boolean indexing.

In [5]:
# Select a single column, which returns a Series
print("--- Selecting the 'Age' column ---")
print(df['Age'])

# Select rows by their integer position with iloc
print("--- Selecting the first two rows (iloc) ---")
print(df.iloc[0:2])

# Select rows and columns by label with loc
# Let's set the 'Name' column as the index first
df_indexed = df.set_index('Name')
print(f"DataFrame with 'Name' as index:\n{df_indexed}")
print("--- Selecting data for 'Bob' (loc) ---")
print(df_indexed.loc['Bob'])

# Boolean Indexing: Filter rows based on a condition
print("--- Selecting people older than 30 ---")
print(df[df['Age'] > 30])

--- Selecting the 'Age' column ---
0    25
1    30
2    35
3    40
Name: Age, dtype: int64
--- Selecting the first two rows (iloc) ---
    Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles
DataFrame with 'Name' as index:
         Age         City
Name                     
Alice     25     New York
Bob       30  Los Angeles
Charlie   35      Chicago
David     40      Houston
--- Selecting data for 'Bob' (loc) ---
Age              30
City    Los Angeles
Name: Bob, dtype: object
--- Selecting people older than 30 ---
      Name  Age     City
2  Charlie   35  Chicago
3    David   40  Houston


## 6. Handling Missing Data

Real-world data is often messy and contains missing values (represented as `NaN`).

In [6]:
# Create a DataFrame with missing values
data_missing = {
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [10, 20, 30, 40]
}
df_missing = pd.DataFrame(data_missing)
print("DataFrame with Missing Values:")
print(df_missing)

# Drop rows with any missing values
print("--- Dropping rows with any NaN ---")
print(df_missing.dropna())

# Fill missing values with a specific value (e.g., the mean of the column)
print("--- Filling NaN with the mean of column 'B' ---")
mean_b = df_missing['B'].mean()
print(df_missing.fillna(value={'B': mean_b}))


DataFrame with Missing Values:
     A    B   C
0  1.0  5.0  10
1  2.0  NaN  20
2  NaN  NaN  30
3  4.0  8.0  40
--- Dropping rows with any NaN ---
     A    B   C
0  1.0  5.0  10
3  4.0  8.0  40
--- Filling NaN with the mean of column 'B' ---
     A    B   C
0  1.0  5.0  10
1  2.0  6.5  20
2  NaN  6.5  30
3  4.0  8.0  40


## 7. Grouping and Aggregation

The `groupby` operation is a powerful tool for splitting data into groups, applying a function, and combining the results.

In [7]:
# Create a sample DataFrame for grouping
data_group = {'Team': ['A', 'B', 'A', 'B', 'A', 'B'],
              'Player': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6'],
              'Points': [10, 12, 10, 12, 14, 15],
              'Rebounds': [5, 7, 7, 8, 9, 11]}
df_group = pd.DataFrame(data_group)
print("Sample DataFrame for Grouping:")
print(df_group)

# Group by the 'Team' column and calculate the sum of points for each team
print("--- Total points per team ---")
print(df_group.groupby('Team')['Points'].sum())

# Group by 'Team' and apply multiple aggregation functions
print("--- Aggregated stats per team ---")
print(df_group.groupby('Team').agg({'Points': ['mean', 'max'], 'Rebounds': 'sum'}))

Sample DataFrame for Grouping:
  Team Player  Points  Rebounds
0    A     P1      10         5
1    B     P2      12         7
2    A     P3      10         7
3    B     P4      12         8
4    A     P5      14         9
5    B     P6      15        11
--- Total points per team ---
Team
A    34
B    39
Name: Points, dtype: int64
--- Aggregated stats per team ---
         Points     Rebounds
           mean max      sum
Team                        
A     11.333333  14       21
B     13.000000  15       26


## 8. Combining DataFrames: Merge, Join, Concat

In [8]:
# Create two DataFrames to combine
left = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'A': ['A0', 'A1', 'A2']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K3'], 'B': ['B0', 'B1', 'B3']})
print(f"Left DF:{left}")
print(f"Right DF:{right}")

# Merge: Combine based on a common column (like a SQL join)
print("--- Inner Merge on 'key' ---")
print(pd.merge(left, right, on='key', how='inner'))

# Concat: Stack DataFrames vertically or horizontally
df_concat = pd.concat([left, right])
print("--- Concatenating (stacking) ---")
print(df_concat)

Left DF:  key   A
0  K0  A0
1  K1  A1
2  K2  A2
Right DF:  key   B
0  K0  B0
1  K1  B1
2  K3  B3
--- Inner Merge on 'key' ---
  key   A   B
0  K0  A0  B0
1  K1  A1  B1
--- Concatenating (stacking) ---
  key    A    B
0  K0   A0  NaN
1  K1   A1  NaN
2  K2   A2  NaN
0  K0  NaN   B0
1  K1  NaN   B1
2  K3  NaN   B3
