### **Exercise 1: Series Creation and Operations**
1. Create a Pandas **Series** from the following list: `[10, 20, 30, 40, 50]`.
2. Assign custom index labels: `['a', 'b', 'c', 'd', 'e']` to this Series.
3. Access the value associated with index `'c'`.
4. Filter out the values greater than 25 from the Series.

---

Solutions

In [8]:
import pandas as pd

# Create a Pandas Series from list:
series = pd.Series([10, 20, 30, 40, 50])
print("Simple list:\n", series)

Simple list:
 0    10
1    20
2    30
3    40
4    50
dtype: int64


In [10]:
# Assign custom index labels: ['a', 'b', 'c', 'd', 'e'] to this Series.
series.index = ['a', 'b', 'c', 'd', 'e']
print("Series with custom index:\n", series)

Series with custom index:
 a    10
b    20
c    30
d    40
e    50
dtype: int64


In [11]:
# Access the value associated with index 'c'.
value_c = series['c']
print("Value at index 'c':", value_c)

Value at index 'c': 30


In [14]:
# Filter out the values greater than 25 from the Series.
filtered_s = series[series > 25]
print("Val;ues greater than 25:\n", filtered_s)

Val;ues greater than 25:
 c    30
d    40
e    50
dtype: int64


### **Exercise 2: Creating a DataFrame**
1. Create a DataFrame with the following data:
    - `Name`: ['John', 'Sara', 'Tom', 'Lucy']
    - `Age`: [23, 29, 22, 30]
    - `City`: ['New York', 'Los Angeles', 'Chicago', 'Houston']
2. Display the DataFrame.
3. Print the `Name` and `Age` columns only.
4. Add a new column `Salary` with values `[50000, 60000, 55000, 70000]`.
5. Remove the `City` column from the DataFrame.

---

Solutions

In [7]:
import pandas as pd
# 1. Create a DataFrame
data = {'Name': ['John', 'Sara', 'Tom', 'Lucy'],
        'Age': [23, 29, 22, 30],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)

In [8]:
# 2. Display the DataFrame
print("Original DataFrame:\n", df)

Original DataFrame:
    Name  Age         City
0  John   23     New York
1  Sara   29  Los Angeles
2   Tom   22      Chicago
3  Lucy   30      Houston


In [9]:
# 3. Print the 'Name' and 'Age' columns only
print("\nName and Age columns:\n", df[['Name', 'Age']])


Name and Age columns:
    Name  Age
0  John   23
1  Sara   29
2   Tom   22
3  Lucy   30


In [10]:
# 4. Add a new column 'Salary'
df['Salary'] = [50000, 60000, 55000, 70000]
print("\nDataFrame after adding 'Salary' column:\n", df)


DataFrame after adding 'Salary' column:
    Name  Age         City  Salary
0  John   23     New York   50000
1  Sara   29  Los Angeles   60000
2   Tom   22      Chicago   55000
3  Lucy   30      Houston   70000


In [11]:
# 5. Remove the 'City' column
df = df.drop(columns=['City'])
print("\nDataFrame after dropping 'City' column:\n", df)


DataFrame after dropping 'City' column:
    Name  Age  Salary
0  John   23   50000
1  Sara   29   60000
2   Tom   22   55000
3  Lucy   30   70000


### **Exercise 3: Loading Data and Inspecting**
1. Load the provided dataset (`dirty_data.csv`) using Pandas.
2. Display the first 5 rows of the dataset.
3. Print the structure of the DataFrame (column names, non-null counts, data types).
4. Get the summary statistics of the numeric columns in the dataset.
5. Check the shape of the dataset.

---

Solutions

In [12]:
# 1. Load the dataset into a DataFrame
url = 'https://raw.githubusercontent.com/siddhantbhattarai/Machine_Learning_Bootcamp_2024/refs/heads/main/Pandas/Data_Cleaning/dirty_data.csv'
df = pd.read_csv(url)

In [13]:
# 2. Display the first 5 rows
print("First 5 rows of the dataset:\n", df.head())

First 5 rows of the dataset:
                Name   Age           City   Join_Date     Salary  Gender
0  Savannah Patrick  21.0  san francisco  2024-02-04  9999999.0    Male
1    Jessica Ramsey  32.0             SF  15/03/2022  9999999.0  Female
2       Jacob White  70.0             LA  2019-11-03        NaN  female
3        Erik Ortiz  62.0  San Francisco  2020-06-14       10.0    male
4      Tonya Dudley  42.0             LA  2023-07-21  2000000.0  Female


In [14]:
# 3. Print the structure of the DataFrame
print("\nStructure of the DataFrame:\n")
df.info()


Structure of the DataFrame:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10100 entries, 0 to 10099
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Name       9077 non-null   object 
 1   Age        9067 non-null   float64
 2   City       10100 non-null  object 
 3   Join_Date  9102 non-null   object 
 4   Salary     7814 non-null   float64
 5   Gender     8853 non-null   object 
dtypes: float64(2), object(4)
memory usage: 473.6+ KB


In [15]:
# 4. Get summary statistics of the numeric columns
print("\nSummary statistics of numeric columns:\n", df.describe())


Summary statistics of numeric columns:
                Age        Salary
count  9067.000000  7.814000e+03
mean     43.918275  3.471742e+06
std      15.406788  4.201636e+06
min      18.000000  1.000000e+01
25%      30.000000  1.000000e+01
50%      44.000000  2.000000e+06
75%      57.000000  9.999999e+06
max      70.000000  9.999999e+06


In [16]:
# 5. Check the shape of the dataset
print("\nShape of the DataFrame:", df.shape)


Shape of the DataFrame: (10100, 6)


### **Exercise 4: Indexing and Selecting Data**
1. Select and print the rows with index labels 2 and 4 from the following DataFrame:
   ```python
   data = {'Product': ['A', 'B', 'C', 'D', 'E'], 'Price': [100, 150, 200, 250, 300]}
   df = pd.DataFrame(data)
   ```
2. Use `.iloc[]` to select the first three rows and first two columns.
3. Use `.loc[]` to select the rows where `Price` is greater than 200.

---

Solutions

In [17]:
# Given DataFrame
data = {'Product': ['A', 'B', 'C', 'D', 'E'], 'Price': [100, 150, 200, 250, 300]}
df = pd.DataFrame(data)

In [18]:
# 1. Select and print the rows with index labels 2 and 4
rows_selected = df.loc[[2, 4]]
print("Rows with index labels 2 and 4:\n", rows_selected)

Rows with index labels 2 and 4:
   Product  Price
2       C    200
4       E    300


In [19]:
# 2. Use .iloc[] to select the first three rows and first two columns
subset_iloc = df.iloc[0:3, 0:2]
print("\nFirst three rows and first two columns using .iloc[]:\n", subset_iloc)


First three rows and first two columns using .iloc[]:
   Product  Price
0       A    100
1       B    150
2       C    200


In [20]:
# 3. Use .loc[] to select the rows where Price is greater than 200
subset_loc = df.loc[df['Price'] > 200]
print("\nRows where Price > 200 using .loc[]:\n", subset_loc)


Rows where Price > 200 using .loc[]:
   Product  Price
3       D    250
4       E    300


### **Exercise 5: Slicing Data**
1. Create the following DataFrame:
   ```python
   data = {'Name': ['Anna', 'Ben', 'Charlie', 'David', 'Eva'],
           'Age': [28, 24, 35, 30, 29],
           'City': ['Seattle', 'Austin', 'New York', 'Boston', 'Denver']}
   df = pd.DataFrame(data)
   ```
2. Use slicing to select the first three rows.
3. Use slicing to select the last two columns of all rows.
4. Use slicing to select rows 2 to 4 and columns `Name` and `City`.

---

Solutions

In [21]:
# Given DataFrame
data = {'Name': ['Anna', 'Ben', 'Charlie', 'David', 'Eva'],
        'Age': [28, 24, 35, 30, 29],
        'City': ['Seattle', 'Austin', 'New York', 'Boston', 'Denver']}
df = pd.DataFrame(data)

In [22]:
# 2. Use slicing to select the first three rows
first_three_rows = df.iloc[0:3]
print("First three rows:\n", first_three_rows)

First three rows:
       Name  Age      City
0     Anna   28   Seattle
1      Ben   24    Austin
2  Charlie   35  New York


In [23]:
# 3. Use slicing to select the last two columns of all rows
last_two_columns = df.iloc[:, 1:3]
print("\nLast two columns of all rows:\n", last_two_columns)


Last two columns of all rows:
    Age      City
0   28   Seattle
1   24    Austin
2   35  New York
3   30    Boston
4   29    Denver


In [24]:
# 4. Use slicing to select rows 2 to 4 and columns 'Name' and 'City'
subset_slicing = df.loc[2:4, ['Name', 'City']]
print("\nRows 2 to 4 and columns 'Name' and 'City':\n", subset_slicing)


Rows 2 to 4 and columns 'Name' and 'City':
       Name      City
2  Charlie  New York
3    David    Boston
4      Eva    Denver
