# Pandas - Part 1: Series and DataFrames

This notebook covers creating and understanding Pandas Series and DataFrames.

**Topics covered:**
- Creating Series from lists and dicts
- Creating DataFrames
- Basic properties (shape, dtypes, index, columns)
- Reading from CSV files
- Basic info and describe methods

**Problems:** 15 (Easy: 1-5, Medium: 6-10, Hard: 11-15)

In [None]:
# ============================================
# SETUP - Run this cell first!
# ============================================
import pandas as pd
import numpy as np
import sys
sys.path.insert(0, '..')
from utils.checker import check
from utils.checks import pandas_01_series_dataframes as verify

print(f"Pandas version: {pd.__version__}")
print("Setup complete!")

---
## Problem 1: Create Series from List
**Difficulty:** Easy

### Concept
A Pandas Series is a one-dimensional labeled array that can hold any data type. It's similar to a Python list or NumPy array, but with added functionality like automatic alignment and built-in operations.

### Syntax
```python
pd.Series(data, index=None)
# data: list, array, or dict
# index: optional labels for the data
```

### Example
```python
>>> s = pd.Series([1, 2, 3, 4, 5])
>>> s
0    1
1    2
2    3
3    4
4    5
dtype: int64
```

### Task
Create a Pandas Series from the list `[10, 20, 30, 40, 50]`. Store it in a variable called `series`.

### Expected Properties
- `series` should be a pandas Series
- Should have length of 5
- First element should be 10

In [None]:
# Your solution:
series = None

In [None]:
# Verification
verify.p1(series)

---
## Problem 2: Create Series with Custom Index
**Difficulty:** Easy

### Concept
Series can have custom labels (index) instead of the default 0, 1, 2, etc. This makes data access more meaningful and enables alignment operations between Series.

### Syntax
```python
pd.Series(data, index=['label1', 'label2', 'label3'])
```

### Example
```python
>>> s = pd.Series([100, 200, 300], index=['x', 'y', 'z'])
>>> s['y']
200
```

### Task
Create a Series with values `[100, 200, 300]` and index `['a', 'b', 'c']`. Store in `indexed_series`.

### Expected Properties
- Should be a pandas Series
- Index should be ['a', 'b', 'c']
- Accessing with index 'a' should return 100

In [None]:
# Your solution:
indexed_series = None

In [None]:
# Verification
verify.p2(indexed_series)

---
## Problem 3: Create DataFrame from Dictionary
**Difficulty:** Easy

### Concept
A DataFrame is a 2-dimensional labeled data structure, like a table with rows and columns. The most common way to create a DataFrame is from a dictionary where keys become column names and values are lists of data.

### Syntax
```python
pd.DataFrame({'col1': [val1, val2], 'col2': [val3, val4]})
```

### Example
```python
>>> df = pd.DataFrame({
...     'name': ['John', 'Jane'],
...     'age': [30, 25]
... })
>>> df
   name  age
0  John   30
1  Jane   25
```

### Task
Create a DataFrame from the following dictionary:
```python
{'name': ['Alice', 'Bob', 'Charlie'],
 'age': [25, 30, 35],
 'city': ['NYC', 'LA', 'Chicago']}
```
Store in `df`.

### Expected Properties
- Should be a pandas DataFrame
- Should have shape (3, 3) - 3 rows, 3 columns
- Should have columns named 'name', 'age', 'city'

In [None]:
# Your solution:
df = None

In [None]:
# Verification
verify.p3(df)

---
## Problem 4: Get DataFrame Shape
**Difficulty:** Easy

### Concept
The shape attribute returns a tuple representing the dimensions of a DataFrame: (number of rows, number of columns). This is essential for understanding your data's size.

### Syntax
```python
df.shape  # Returns (rows, columns)
```

### Example
```python
>>> df.shape
(100, 5)  # 100 rows, 5 columns
```

### Task
Get the shape of the DataFrame you created in Problem 3. Store in `df_shape`.

### Expected Properties
- Should be a tuple
- Should equal (3, 3)

In [None]:
# Your solution:
df_shape = None

In [None]:
# Verification
verify.p4(df_shape)

---
## Problem 5: Get Column Names
**Difficulty:** Easy

### Concept
The columns attribute returns an Index object containing all column names. You can convert it to a list for easier manipulation.

### Syntax
```python
df.columns        # Returns Index object
list(df.columns)  # Convert to list
```

### Example
```python
>>> list(df.columns)
['name', 'age', 'salary']
```

### Task
Get the column names of `df` as a list. Store in `columns`.

### Expected Properties
- Should be a list
- Should equal ['name', 'age', 'city']

In [None]:
# Your solution:
columns = None

In [None]:
# Verification
verify.p5(columns)

---
## Problem 6: Read CSV File
**Difficulty:** Medium

### Concept
Reading data from CSV files is one of the most common operations in data analysis. Pandas provides `pd.read_csv()` which automatically infers data types and creates a DataFrame.

### Syntax
```python
df = pd.read_csv('filepath.csv')
# Optional parameters: sep, header, names, index_col, etc.
```

### Example
```python
>>> df = pd.read_csv('data.csv')
>>> df.head()
```

### Task
Read the Titanic dataset from `../datasets/public/titanic.csv`. Store in `titanic`.

### Expected Properties
- Should be a pandas DataFrame
- Should not be empty (has more than 0 rows)
- Should have multiple columns

In [None]:
# Your solution:
titanic = None

In [None]:
# Verification
check.is_type(titanic, pd.DataFrame, "P6: Type check")
check.is_true(len(titanic) > 0, "P6: Not empty", "DataFrame should have rows")
check.is_true(len(titanic.columns) > 0, "P6: Has columns", "DataFrame should have columns")

---
## Problem 7: Get Data Types
**Difficulty:** Medium

### Concept
The dtypes attribute shows the data type of each column. Understanding data types is crucial for performing correct operations and optimizing memory usage.

### Syntax
```python
df.dtypes  # Returns Series with column names as index
```

### Example
```python
>>> df.dtypes
name      object
age        int64
salary   float64
dtype: object
```

### Task
Get the data types of all columns in the `titanic` DataFrame. Store in `dtypes`.

### Expected Properties
- Should be a pandas Series
- Length should equal number of columns in titanic

In [None]:
# Your solution:
dtypes = None

In [None]:
# Verification
check.is_type(dtypes, pd.Series, "P7: Type check")
check.is_true(len(dtypes) == len(titanic.columns), "P7: Length matches columns", "Should have dtype for each column")

---
## Problem 8: View First N Rows
**Difficulty:** Medium

### Concept
The `head()` method returns the first n rows of a DataFrame. This is useful for quickly inspecting your data without loading everything.

### Syntax
```python
df.head(n)  # Default n=5
```

### Example
```python
>>> df.head(3)  # First 3 rows
```

### Task
Get the first 5 rows of `titanic`. Store in `head_5`.

### Expected Properties
- Should be a pandas DataFrame
- Should have exactly 5 rows
- Should have same columns as original titanic DataFrame

In [None]:
# Your solution:
head_5 = None

In [None]:
# Verification
check.is_type(head_5, pd.DataFrame, "P8: Type check")
check.has_length(head_5, 5, "P8: Length")
check.is_true(list(head_5.columns) == list(titanic.columns), "P8: Same columns", "Should have same columns as titanic")

---
## Problem 9: Get Basic Statistics
**Difficulty:** Medium

### Concept
The `describe()` method generates descriptive statistics for numeric columns, including count, mean, std, min, quartiles, and max. This provides a quick overview of your data's distribution.

### Syntax
```python
df.describe()  # Statistics for numeric columns only
```

### Example
```python
>>> df.describe()
             age      salary
count    5.000000     5.00000
mean    28.000000  55000.00000
std      5.701754   9899.49494
min     22.000000  45000.00000
...
```

### Task
Get descriptive statistics for numerical columns in `titanic` using `.describe()`. Store in `stats`.

### Expected Properties
- Should be a pandas DataFrame
- Should contain statistical measures (count, mean, std, etc.)
- Index should include 'mean'

In [None]:
# Your solution:
stats = None

In [None]:
# Verification
check.is_type(stats, pd.DataFrame, "P9: Type check")
check.contains(list(stats.index), 'mean', "P9: Contains 'mean'")
check.contains(list(stats.index), 'std', "P9: Contains 'std'")

---
## Problem 10: Access Single Column
**Difficulty:** Medium

### Concept
You can extract a single column from a DataFrame using bracket notation. This returns a Series. Two syntaxes work: `df['column']` or `df.column`.

### Syntax
```python
df['column_name']  # Preferred
df.column_name     # Works if no spaces in name
```

### Example
```python
>>> ages = df['age']
>>> type(ages)
<class 'pandas.core.series.Series'>
```

### Task
Extract the 'age' column from `df` (the one you created in Problem 3) as a Series. Store in `ages`.

### Expected Properties
- Should be a pandas Series
- Should have length 3
- Values should be [25, 30, 35]

In [None]:
# Your solution:
ages = None

In [None]:
# Verification
check.is_type(ages, pd.Series, "P10: Type check")
check.has_length(ages, 3, "P10: Length")
check.is_true(list(ages) == [25, 30, 35], "P10: Correct values", "Values should be [25, 30, 35]")

---
## Problem 11: Create DataFrame with Custom Index
**Difficulty:** Hard

### Concept
DataFrames can have custom row labels (index) instead of default 0, 1, 2, etc. You can specify the index during creation using the `index` parameter.

### Syntax
```python
pd.DataFrame(data, columns=['A', 'B'], index=['row1', 'row2'])
```

### Example
```python
>>> df = pd.DataFrame(
...     [[1, 2], [3, 4]],
...     columns=['X', 'Y'],
...     index=['a', 'b']
... )
>>> df
   X  Y
a  1  2
b  3  4
```

### Task
Create a DataFrame with:
- Columns: 'A', 'B', 'C'
- Index: 'row1', 'row2', 'row3'
- Values: a 3x3 matrix containing integers 1-9

Store in `custom_df`.

### Expected Properties
- Should have columns ['A', 'B', 'C']
- Should have index ['row1', 'row2', 'row3']
- Should have shape (3, 3)

In [None]:
# Your solution:
custom_df = None

In [None]:
# Verification
check.has_columns(custom_df, ['A', 'B', 'C'], "P11: Columns")
check.has_index(custom_df, ['row1', 'row2', 'row3'], "P11: Index")
check.has_shape(custom_df, (3, 3), "P11: Shape")

---
## Problem 12: Rename Columns
**Difficulty:** Hard

### Concept
The `rename()` method allows you to change column names. Use a dictionary to map old names to new names. The `inplace=False` parameter (default) returns a new DataFrame.

### Syntax
```python
df.rename(columns={'old_name': 'new_name'})
# or
df.columns = ['new1', 'new2', 'new3']  # Rename all at once
```

### Example
```python
>>> df = df.rename(columns={'name': 'Name', 'age': 'Age'})
```

### Task
Create a copy of `df` and rename the columns to: 'Name', 'Age', 'City' (capitalized). Store in `df_renamed`.

### Expected Properties
- Should have columns ['Name', 'Age', 'City']
- Should have same shape as original df
- Original df should remain unchanged

In [None]:
# Your solution:
df_renamed = None

In [None]:
# Verification
check.has_columns(df_renamed, ['Name', 'Age', 'City'], "P12: Columns")
check.has_shape(df_renamed, (3, 3), "P12: Shape")
check.is_true(list(df.columns) == ['name', 'age', 'city'], "P12: Original unchanged", "Original df should not be modified")

---
## Problem 13: Set Index
**Difficulty:** Hard

### Concept
The `set_index()` method converts a column into the DataFrame's index. This is useful when you want to use a column's values as row labels for easier access.

### Syntax
```python
df.set_index('column_name')  # Returns new DataFrame
df.set_index('column_name', inplace=True)  # Modifies in place
```

### Example
```python
>>> df_indexed = df.set_index('employee_id')
>>> df_indexed.loc[101]  # Access by employee_id
```

### Task
Create a copy of `df` and set the 'name' column as the index. Store in `df_indexed`.

### Expected Properties
- Index should be ['Alice', 'Bob', 'Charlie']
- 'name' should no longer be a column
- Should have 2 columns remaining

In [None]:
# Your solution:
df_indexed = None

In [None]:
# Verification
check.has_index(df_indexed, ['Alice', 'Bob', 'Charlie'], "P13: Index")
check.not_contains(list(df_indexed.columns), 'name', "P13: 'name' not in columns")
check.is_true(len(df_indexed.columns) == 2, "P13: Two columns remain", "Should have 2 columns after setting index")

---
## Problem 14: Add New Column
**Difficulty:** Hard

### Concept
You can add new columns to a DataFrame by assigning values to a new column name. The values can be a scalar (applied to all rows), a list, or a Series.

### Syntax
```python
df['new_column'] = value  # Scalar assigned to all rows
df['new_column'] = [val1, val2, val3]  # List of values
df['new_column'] = df['col1'] + df['col2']  # Computed from other columns
```

### Example
```python
>>> df['status'] = 'active'
>>> df['total'] = df['price'] * df['quantity']
```

### Task
Create a copy of `df` and add a new column 'country' with all values as 'USA'. Store in `df_with_country`.

### Expected Properties
- Should contain a column named 'country'
- All values in 'country' should be 'USA'
- Should have 4 columns total

In [None]:
# Your solution:
df_with_country = None

In [None]:
# Verification
check.contains_column(df_with_country, 'country', "P14: Has 'country' column")
check.is_true(all(df_with_country['country'] == 'USA'), "P14: All values are 'USA'", "All country values should be 'USA'")
check.is_true(len(df_with_country.columns) == 4, "P14: Four columns", "Should have 4 columns")

---
## Problem 15: Create DataFrame from NumPy Array
**Difficulty:** Hard

### Concept
DataFrames can be created from NumPy arrays. You need to specify column names and optionally custom index labels, since NumPy arrays don't have these built-in.

### Syntax
```python
pd.DataFrame(numpy_array, columns=['A', 'B'], index=['x', 'y'])
```

### Example
```python
>>> arr = np.array([[1, 2], [3, 4]])
>>> df = pd.DataFrame(arr, columns=['X', 'Y'], index=['a', 'b'])
```

### Task
Create a DataFrame from a 4x3 NumPy array of random integers (0-99). Set columns as ['X', 'Y', 'Z'] and index as ['a', 'b', 'c', 'd']. Store in `np_df`.

Use `np.random.seed(42)` before generating the array (already provided).

### Expected Properties
- Should have shape (4, 3)
- Should have columns ['X', 'Y', 'Z']
- Should have index ['a', 'b', 'c', 'd']

In [None]:
# Your solution:
np.random.seed(42)
np_df = None

In [None]:
# Verification
check.has_shape(np_df, (4, 3), "P15: Shape")
check.has_columns(np_df, ['X', 'Y', 'Z'], "P15: Columns")
check.has_index(np_df, ['a', 'b', 'c', 'd'], "P15: Index")

---
## Summary

Run this cell to see your overall progress on this notebook.

In [None]:
check.summary()