# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#2: DataFrames in Depth`**

4. **Creating DataFrames**
    
    - From lists, dictionaries, and arrays
    - Reading data from CSV, Excel, and other formats
5. **Basic DataFrame Operations**
    
    - Inspecting the DataFrame
    - Indexing and selecting data
    - Descriptive statistics
6. **Data Cleaning and Handling Missing Data**
    
    - Handling missing values
    - Dropping or filling missing values
    - Removing duplicates


### **`Hands On Experience:`**


#### Question 1: Creating a DataFrame from Lists and Basic Operations

#### Scenario:
You have information about monthly sales for a retail store. Each list contains data for a different month.

```python
# Data for three months
months = ['Jan', 'Feb', 'Mar']
sales = [1200, 1500, 1800]
expenses = [800, 900, 1000]

# Question:
# Create a DataFrame named 'df_sales' from these lists, and display the DataFrame.
# Calculate the profit for each month (Profit = Sales - Expenses).
# Display the DataFrame after adding the 'Profit' column.
```

In [1]:
# Data for three months
months = ['Jan', 'Feb', 'Mar']
sales = [1200, 1500, 1800]
expenses = [800, 900, 1000]

import pandas as pd

# Creating a DataFrame from Lists
df_sales = pd.DataFrame({'Month': months, 'Sales': sales, 'Expenses': expenses})

# Calculating Profit
df_sales['Profit'] = df_sales['Sales'] - df_sales['Expenses']

# Displaying the DataFrame
print("DataFrame after Creating and Calculating Profit:")
print(df_sales)

DataFrame after Creating and Calculating Profit:
  Month  Sales  Expenses  Profit
0   Jan   1200       800     400
1   Feb   1500       900     600
2   Mar   1800      1000     800


#### Question 2: Reading Data from CSV and Descriptive Statistics

#### Scenario:

Let's assume you have a CSV file named 'sales_data.csv' with the following structure:

```csv
Product,Quantity,Revenue
Laptop,10,12000
Smartphone,5,8000
Tablet,,4500
Camera,3,
```

You have a CSV file named 'sales_data.csv' containing information about product sales. Read the data into a DataFrame and perform descriptive statistics.

```python
# Question:
# Read 'sales_data.csv' into a DataFrame named 'df_sales'.
# Display the first 5 rows of the DataFrame.
# Calculate basic descriptive statistics for the 'Quantity' column.
```



In [2]:
import pandas as pd

# Reading Data from CSV
df_sales = pd.read_csv('sales_data.csv')

# Displaying the first 5 rows
print("First 5 Rows of df_sales:")
print(df_sales.head()) # Head by default display 5 rows

# Descriptive Statistics for 'Quantity'
quantity_stats = df_sales['Quantity'].describe()
print("\nDescriptive Statistics for 'Quantity':")
print(quantity_stats)

First 5 Rows of df_sales:
      Product  Quantity  Revenue
0      Laptop      10.0  12000.0
1  Smartphone       5.0   8000.0
2      Tablet       NaN   4500.0
3      Camera       3.0      NaN

Descriptive Statistics for 'Quantity':
count     3.000000
mean      6.000000
std       3.605551
min       3.000000
25%       4.000000
50%       5.000000
75%       7.500000
max      10.000000
Name: Quantity, dtype: float64


#### Question 3: Handling Missing Values and Filling with Mean

#### Scenario:
Your DataFrame has missing values in the 'Revenue' column. Handle the missing values by filling them with the mean.

```python
# Question:
# Handle missing values in the 'Revenue' column by filling them with the mean.
# Display the DataFrame after handling missing values.
```

In [3]:
# Handling Missing Values in 'Revenue'
df_sales['Revenue'].fillna(df_sales['Revenue'].mean(), inplace=True)

# Displaying the DataFrame after Handling Missing Values
print("DataFrame after Handling Missing Values in 'Revenue':")
print(df_sales)

DataFrame after Handling Missing Values in 'Revenue':
      Product  Quantity       Revenue
0      Laptop      10.0  12000.000000
1  Smartphone       5.0   8000.000000
2      Tablet       NaN   4500.000000
3      Camera       3.0   8166.666667


#### Question 4: Removing Duplicates

#### Scenario:
Your DataFrame 'df_orders' contains duplicate entries for customer orders. Remove the duplicates based on all columns.

```python
# Question:
# Identify and remove duplicate rows from 'df_orders'.
# Display the DataFrame after removing duplicates.
```

In [12]:
import pandas as pd

# Sample data for the DataFrame
data = {
    'OrderID': [1, 2, 3, 4, 2, 5, 6, 1],
    'Product': ['A', 'B', 'C', 'D', 'B', 'E', 'F', 'A'],
    'Price': [10.0, 20.0, 15.0, 25.0, 20.0, 30.0, 18.0, 10.0],
    'Shipping_Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-02', '2022-01-05', '2022-01-06', '2022-01-01'],
    'Quantity': [2, 1, 3, 2, 1, 4, 2, 3]  # New 'Quantity' column
}

# Create a DataFrame
df_orders = pd.DataFrame(data)

# Identifying Duplicate Rows
duplicates = df_orders.duplicated()

# Removing Duplicate Rows
df_orders_no_duplicates = df_orders.drop_duplicates()

# Displaying the DataFrame after Removing Duplicates
print("Original DataFrame:")
print(df_orders)

print("\nDataFrame after Removing Duplicates:")
print(df_orders_no_duplicates)


Original DataFrame:
   OrderID Product  Price Shipping_Date  Quantity
0        1       A   10.0    2022-01-01         2
1        2       B   20.0    2022-01-02         1
2        3       C   15.0    2022-01-03         3
3        4       D   25.0    2022-01-04         2
4        2       B   20.0    2022-01-02         1
5        5       E   30.0    2022-01-05         4
6        6       F   18.0    2022-01-06         2
7        1       A   10.0    2022-01-01         3

DataFrame after Removing Duplicates:
   OrderID Product  Price Shipping_Date  Quantity
0        1       A   10.0    2022-01-01         2
1        2       B   20.0    2022-01-02         1
2        3       C   15.0    2022-01-03         3
3        4       D   25.0    2022-01-04         2
5        5       E   30.0    2022-01-05         4
6        6       F   18.0    2022-01-06         2
7        1       A   10.0    2022-01-01         3


#### Question 5: Conditional Indexing and Filtering

#### Scenario:
You want to analyze only the orders with a quantity greater than 2.

```python
# Question:
# Create a new DataFrame 'df_large_orders' containing only the orders with Quantity greater than 2.
# Display the new DataFrame.
```

In [13]:
# Conditional Indexing and Filtering
df_large_orders = df_orders[df_orders['Quantity'] > 2]

# Displaying the DataFrame with Large Orders
print("DataFrame with Orders Quantity > 2:")
print(df_large_orders)

DataFrame with Orders Quantity > 2:
   OrderID Product  Price Shipping_Date  Quantity
2        3       C   15.0    2022-01-03         3
5        5       E   30.0    2022-01-05         4
7        1       A   10.0    2022-01-01         3
