# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#2: DataFrames in Depth`**
4. **Creating DataFrames**
   - From lists, dictionaries, and arrays
   - Reading data from CSV, Excel, and other formats

5. **Basic DataFrame Operations**
   - Inspecting the DataFrame
   - Indexing and selecting data
   - Descriptive statistics

6. **Data Cleaning and Handling Missing Data**
   - Handling missing values
   - Dropping or filling missing values
   - Removing duplicates

### **`4. Creating DataFrames: `**


#### `From Lists, Dictionaries, and Arrays`

**Introduction:**
Creating a Pandas DataFrame is a fundamental step in data analysis. In this prompt, we will explore three common methods for creating DataFrames: using lists, dictionaries, and arrays.

**From Lists:**

1. **Using Lists as Columns:**
   - You can create a DataFrame by using lists as columns. Each list represents a column, and the lengths of the lists must match.
     ```python
     import pandas as pd

     names = ['Alice', 'Bob', 'Charlie']
     ages = [25, 30, 35]
     cities = ['New York', 'San Francisco', 'Los Angeles']

     df_from_lists = pd.DataFrame({'Name': names, 'Age': ages, 'City': cities})
     ```

2. **Specifying Index:**
   - You can specify a custom index for the DataFrame:
     ```python
     custom_index = ['person1', 'person2', 'person3']
     df_from_lists_custom_index = pd.DataFrame({'Name': names, 'Age': ages, 'City': cities}, index=custom_index)
     ```

**From Dictionaries:**

1. **Using Dictionary Keys as Columns:**
   - Creating a DataFrame from a dictionary allows you to use the keys as column names.
     ```python
     data_dict = {'Name': ['Alice', 'Bob', 'Charlie'],
                  'Age': [25, 30, 35],
                  'City': ['New York', 'San Francisco', 'Los Angeles']}
     
     df_from_dict = pd.DataFrame(data_dict)
     ```

2. **Specifying Index:**
   - Similar to the list method, you can specify a custom index:
     ```python
     df_from_dict_custom_index = pd.DataFrame(data_dict, index=custom_index)
     ```

**From Arrays:**

1. **Using NumPy Arrays:**
   - NumPy arrays can be used to create DataFrames. Ensure that the dimensions match for each array.
     ```python
     import numpy as np

     names_array = np.array(['Alice', 'Bob', 'Charlie'])
     ages_array = np.array([25, 30, 35])
     cities_array = np.array(['New York', 'San Francisco', 'Los Angeles'])

     df_from_arrays = pd.DataFrame({'Name': names_array, 'Age': ages_array, 'City': cities_array})
     ```

2. **Specifying Index:**
   - As before, you can specify a custom index:
     ```python
     df_from_arrays_custom_index = pd.DataFrame({'Name': names_array, 'Age': ages_array, 'City': cities_array}, index=custom_index)
     ```

**Importance of Specifying Column Names and Indices:**

1. **Clarity and Readability:**
   - Specifying meaningful column names enhances the clarity and readability of your code and data.

2. **Consistency in Analysis:**
   - A consistent index allows for smoother and more predictable data analysis, especially when combining DataFrames or performing complex operations.

3. **Avoiding Ambiguity:**
   - Explicitly defining column names and indices avoids ambiguity and ensures that each piece of data is correctly associated with its intended category.

**Conclusion:**
Creating DataFrames in Pandas using lists, dictionaries, and arrays provides flexibility and versatility in handling different types of data. Specifying column names and indices during DataFrame creation is essential for clarity and consistency in subsequent data analysis tasks.


#### Example :

In [1]:
import pandas as pd
import numpy as np

# Creating a DataFrame from Lists
list_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 22, 28],
    'Salary': [50000, 60000, 75000, 48000, 55000],
    'Experience': [3, 5, 8, 2, 4],
}

df_from_lists = pd.DataFrame(list_data)

# Creating a DataFrame from a Dictionary
dict_data = {
    'ID': [1, 2, 3, 4, 5],
    'Subject': ['Math', 'Physics', 'Chemistry', 'Biology', 'English'],
    'Score': [85, 92, 78, 89, 95],
}

df_from_dict = pd.DataFrame(dict_data)

# Creating a DataFrame from Arrays (using NumPy)
array_data = np.array([
    [1, 'Apple', 3],
    [2, 'Banana', 6],
    [3, 'Orange', 4],
])

df_from_arrays = pd.DataFrame(array_data, columns=['ID', 'Fruit', 'Quantity'])

# Displaying the DataFrames
print("DataFrame from Lists:")
print(df_from_lists)

print("\nDataFrame from Dictionary:")
print(df_from_dict)

print("\nDataFrame from Arrays:")
print(df_from_arrays)


DataFrame from Lists:
      Name  Age  Salary  Experience
0    Alice   25   50000           3
1      Bob   30   60000           5
2  Charlie   35   75000           8
3    David   22   48000           2
4     Emma   28   55000           4

DataFrame from Dictionary:
   ID    Subject  Score
0   1       Math     85
1   2    Physics     92
2   3  Chemistry     78
3   4    Biology     89
4   5    English     95

DataFrame from Arrays:
  ID   Fruit Quantity
0  1   Apple        3
1  2  Banana        6
2  3  Orange        4


#### Real World Scenario:
Imagine you have survey data from a group of people regarding their preferences for various types of electronic devices. Each person's data includes their ID, name, age, and the ratings (out of 10) they gave to different devices like smartphones, laptops, and smartwatches.

In [2]:
import pandas as pd

# Sample survey data
survey_data = {
    'ID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 22, 28],
    'Smartphone_Rating': [9, 8, 7, 9, 8],
    'Laptop_Rating': [8, 7, 9, 6, 8],
    'Smartwatch_Rating': [7, 6, 8, 7, 9],
}

# Creating a DataFrame from the survey data
df_survey = pd.DataFrame(survey_data)

# Displaying the DataFrame
print("Survey Data DataFrame:")
print(df_survey)

Survey Data DataFrame:
   ID     Name  Age  Smartphone_Rating  Laptop_Rating  Smartwatch_Rating
0   1    Alice   25                  9              8                  7
1   2      Bob   30                  8              7                  6
2   3  Charlie   35                  7              9                  8
3   4    David   22                  9              6                  7
4   5     Emma   28                  8              8                  9


#### Considerations or Peculiarities:

- **Column Consistency:** Ensure consistency in the length of lists or arrays when creating a DataFrame. All lists should have the same length, or dictionaries should have the same set of keys.

- **Data Types:** Be mindful of the data types within lists or arrays. Pandas will attempt to infer data types, but it's helpful to explicitly specify them if needed.

- **Indexing:** Decide whether you need to set a specific column as the index. In the example above, 'ID' is set as the index, but you may choose another column or leave it with the default integer index.

#### Common Mistakes:

- **Mismatched Lengths:** Forgetting to check and ensure that all lists or arrays used to create a DataFrame have the same length can lead to errors.

- **Misspelled Column Names:** When creating a DataFrame from a dictionary, ensure that the keys represent column names. Misspelling a key may result in the creation of a new column instead of using an existing one.

- **Incorrect Data Types:** If your data types are not appropriate, it can lead to unexpected results. Check that numeric columns are treated as numbers, and categorical columns are specified as such.


#### `Reading Data into a DataFrame from Various Formats`

**Introduction:**
Pandas provides versatile functions to read data from different file formats, making it a powerful tool for handling diverse data sources. In this prompt, we will explore how to read data into a DataFrame from common formats such as CSV and Excel, and discuss additional formats supported by Pandas.

**Reading from CSV:**

1. **Using `read_csv` Function:**
   - Reading data from a CSV file is straightforward using the `read_csv` function:
     ```python
     import pandas as pd

     df_csv = pd.read_csv('data.csv')
     ```

2. **Customizing Parameters:**
   - You can customize parameters such as delimiter, encoding, and header during reading:
     ```python
     df_custom_csv = pd.read_csv('data.csv', delimiter=';', encoding='utf-8', header=0)
     ```

**Reading from Excel:**

1. **Using `read_excel` Function:**
   - Reading data from an Excel file is accomplished with the `read_excel` function:
     ```python
     df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
     ```

2. **Specifying Columns:**
   - You can specify columns to read from Excel:
     ```python
     df_excel_columns = pd.read_excel('data.xlsx', sheet_name='Sheet1', usecols=['Name', 'Age'])
     ```

**Reading from Other Formats:**

1. **JSON:**
   - Pandas supports reading data from JSON files:
     ```python
     df_json = pd.read_json('data.json')
     ```

2. **HTML (Web Scraping):**
   - Reading tables from HTML pages (web scraping) is possible using `read_html`:
     ```python
     url = 'https://example.com/table'
     df_html = pd.read_html(url)[0]  # [0] selects the first table from the page
     ```

3. **SQL Databases:**
   - Reading data from SQL databases using `read_sql`:
     ```python
     from sqlalchemy import create_engine

     engine = create_engine('sqlite:///example.db')
     query = 'SELECT * FROM my_table'
     df_sql = pd.read_sql(query, engine)
     ```

**Flexibility in Handling Diverse Data Sources:**

1. **URLs and HTTP(S):**
   - Reading data directly from URLs:
     ```python
     url = 'https://example.com/data.csv'
     df_url = pd.read_csv(url)
     ```

2. **ZIP Archives:**
   - Reading data from files within a ZIP archive:
     ```python
     df_zip = pd.read_csv('archive.zip', compression='zip', header=0)
     ```

3. **Reading from Clipboard:**
   - Copying data to the clipboard and reading it directly into a DataFrame:
     ```python
     df_clipboard = pd.read_clipboard()
     ```

**Conclusion:**
Pandas' flexibility in reading data from various formats, including CSV, Excel, JSON, HTML, SQL databases, and more, makes it a versatile tool for handling diverse data sources. The ability to read directly from URLs, ZIP archives, and the clipboard enhances its capabilities for real-world data scenarios.


### **`5. Basic DataFrame Operations`**


#### **`Inspecting the DataFrame`**

**Introduction:**
Inspecting a DataFrame is an essential step in understanding its structure and contents. Pandas provides several methods that allow you to gain insights into the data quickly. In this prompt, we'll explore common methods such as `head()`, `tail()`, `info()`, `shape`, and `describe()`.

**Using `head()` and `tail()`:**

1. **`head(n)`:**
   - The `head()` method displays the first `n` rows of the DataFrame. It is useful for quickly getting an overview of the dataset.
     ```python
     import pandas as pd

     df = pd.read_csv('data.csv')
     df_head = df.head(5)  # Display the first 5 rows
     ```

2. **`tail(n)`:**
   - The `tail()` method shows the last `n` rows of the DataFrame, allowing you to inspect the end of the dataset.
     ```python
     df_tail = df.tail(5)  # Display the last 5 rows
     ```

**Using `info()`:**

1. **`info()`:**
   - The `info()` method provides a concise summary of the DataFrame, including the data types, non-null counts, and memory usage.
     ```python
     df_info = df.info()
     ```

**Using `shape`:**

1. **`shape`:**
   - The `shape` attribute returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).
     ```python
     df_shape = df.shape
     ```

**Using `describe()`:**

1. **`describe()`:**
   - The `describe()` method generates descriptive statistics, including measures of central tendency, dispersion, and shape of the distribution.
     ```python
     df_describe = df.describe()
     ```

2. **Customizing `describe()`:**
   - You can customize the output of `describe()` to include specific percentiles or types of statistics.
     ```python
     custom_describe = df.describe(percentiles=[0.25, 0.5, 0.75], include='all')
     ```

**Conclusion:**
Inspecting a DataFrame is a crucial step in the data analysis process. Methods such as `head()`, `tail()`, `info()`, `shape`, and `describe()` provide valuable information about the structure, contents, and statistical summary of the dataset. Using these methods allows you to quickly assess the data and make informed decisions about further analysis.


#### Example:


In [3]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 22, 28],
    'Salary': [50000, 60000, 75000, 48000, 55000],
    'Experience': [3, 5, 8, 2, 4],
}

df = pd.DataFrame(data)

# Displaying the DataFrame
print("Original DataFrame:")
print(df)

# Using head() and tail() for an overview
print("\nFirst 3 Rows (head()):")
print(df.head(3))

print("\nLast 2 Rows (tail()):")
print(df.tail(2))

# Using info() for a summary
print("\nDataFrame Info:")
df_info = df.info()

# Using shape to get dimensions
print("\nDataFrame Shape:")
df_shape = df.shape

# Using describe() for summary statistics
print("\nSummary Statistics:")
df_describe = df.describe()

# Displaying the results
print("\nResults:")
print(df_info)
print("\nDataFrame Shape:", df_shape)
print("\nSummary Statistics:\n", df_describe)


Original DataFrame:
      Name  Age  Salary  Experience
0    Alice   25   50000           3
1      Bob   30   60000           5
2  Charlie   35   75000           8
3    David   22   48000           2
4     Emma   28   55000           4

First 3 Rows (head()):
      Name  Age  Salary  Experience
0    Alice   25   50000           3
1      Bob   30   60000           5
2  Charlie   35   75000           8

Last 2 Rows (tail()):
    Name  Age  Salary  Experience
3  David   22   48000           2
4   Emma   28   55000           4

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        5 non-null      object
 1   Age         5 non-null      int64 
 2   Salary      5 non-null      int64 
 3   Experience  5 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 288.0+ bytes

DataFrame Shape:

Summary Statistics:

Results:
None

Data

#### Real-world Scenario:
Consider a scenario where you have a dataset containing information about sales transactions for an e-commerce platform. You want to inspect the data to understand its structure, check for missing values, and get a quick overview of the sales performance.

In [4]:
import pandas as pd

# Sample e-commerce sales data
sales_data = {
    'OrderID': [101, 102, 103, 104, 105],
    'Product': ['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Camera'],
    'Quantity': [2, 1, 3, 2, 1],
    'Price': [1200, 800, 300, 150, 700],
    'CustomerID': [101, 102, 103, 104, 105],
    'Date': ['2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03', '2022-01-03'],
}

# Creating a DataFrame from the sales data
df_sales = pd.DataFrame(sales_data)

# Inspecting the DataFrame
print("Overview of Sales Data:")
print(df_sales.head())
print("\nStructure of Sales Data:")
print(df_sales.info())
print("\nSummary Statistics of Sales Data:")
print(df_sales.describe())

Overview of Sales Data:
   OrderID     Product  Quantity  Price  CustomerID        Date
0      101      Laptop         2   1200         101  2022-01-01
1      102  Smartphone         1    800         102  2022-01-02
2      103      Tablet         3    300         103  2022-01-02
3      104  Headphones         2    150         104  2022-01-03
4      105      Camera         1    700         105  2022-01-03

Structure of Sales Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   OrderID     5 non-null      int64 
 1   Product     5 non-null      object
 2   Quantity    5 non-null      int64 
 3   Price       5 non-null      int64 
 4   CustomerID  5 non-null      int64 
 5   Date        5 non-null      object
dtypes: int64(4), object(2)
memory usage: 368.0+ bytes
None

Summary Statistics of Sales Data:
          OrderID  Quantity        Price  CustomerI

#### Considerations or Peculiarities:

- **Data Types:** Ensure that data types are appropriate for each column. Dates should be in datetime format, and numerical columns should have the correct data type.

- **Missing Values:** Check for missing values using methods like `isnull()` or `info()`. Decide on a strategy to handle missing data if needed.

- **Categorical Columns:** Identify and encode categorical columns appropriately. Some columns may have a finite set of categories, and using the `astype('category')` method can save memory.

#### Common Mistakes:

- **Neglecting Missing Values:** Ignoring missing values during inspection can lead to incorrect analyses. Always check for missing data and decide how to handle it.

- **Not Understanding Data Types:** Misinterpreting data types may lead to errors in analysis. Make sure to understand the meaning and representation of each column's data type.

- **Overlooking Categorical Variables:** Categorical variables may not always be automatically identified. Check and convert categorical columns if needed, especially if they are nominal or ordinal.

Inspecting the DataFrame is a crucial step to understand the data's characteristics and make informed decisions during data analysis. Adapt the example code and considerations based on the specifics of your real-world datasets.


#### **`Indexing and Selecting Data in a DataFrame`**

**Introduction:**
Indexing and selecting data in a Pandas DataFrame are fundamental operations for extracting specific subsets of information. Two main methods for this purpose are `loc[]` and `iloc[]`. In this prompt, we'll explore these methods and provide examples of conditional indexing and boolean indexing.

**Using `loc[]` for Label-Based Indexing:**

1. **Selecting Rows by Label:**
   - Use `loc[]` to select rows based on their labels (index values):
     ```python
     import pandas as pd

     df = pd.read_csv('data.csv', index_col='ID')
     selected_row = df.loc[2]  # Select row with index 2
     ```

2. **Selecting Specific Columns for a Row:**
   - Specify both row label and column label to select a specific value:
     ```python
     specific_value = df.loc[2, 'Name']  # Select 'Name' for row with index 2
     ```

3. **Slicing Rows:**
   - Use slicing with labels to select a range of rows:
     ```python
     sliced_rows = df.loc[2:5]  # Select rows with indices 2 to 5 (inclusive)
     ```

4. **Selecting Rows and Columns Simultaneously:**
   - Use `loc[]` to select specific rows and columns:
     ```python
     selected_data = df.loc[2:5, ['Name', 'Age']]
     ```

**Using `iloc[]` for Position-Based Indexing:**

1. **Selecting Rows by Position:**
   - Use `iloc[]` to select rows based on their integer positions:
     ```python
     selected_row_position = df.iloc[1]  # Select the second row (position 1)
     ```

2. **Selecting Specific Columns for a Row by Position:**
   - Specify both row position and column position to select a specific value:
     ```python
     specific_value_position = df.iloc[1, 0]  # Select the first column for the second row
     ```

3. **Slicing Rows by Position:**
   - Use slicing with integer positions to select a range of rows:
     ```python
     sliced_rows_position = df.iloc[1:4]  # Select rows with positions 1 to 3
     ```

4. **Selecting Rows and Columns Simultaneously by Position:**
   - Use `iloc[]` to select specific rows and columns by position:
     ```python
     selected_data_position = df.iloc[1:4, [0, 1]]
     ```

**Conditional Indexing and Boolean Indexing:**

1. **Conditional Indexing:**
   - Use boolean conditions to filter rows based on a specific criterion:
     ```python
     condition = df['Age'] > 30
     conditionally_selected = df.loc[condition]
     ```

2. **Boolean Indexing:**
   - Use boolean arrays directly for filtering:
     ```python
     boolean_selected = df[df['Age'] > 30]
     ```

**Conclusion:**
Indexing and selecting data in a Pandas DataFrame using `loc[]` and `iloc[]` are powerful techniques. These methods allow you to retrieve specific rows and columns based on labels or positions. Additionally, conditional indexing and boolean indexing enable you to filter data efficiently based on specific criteria.


#### Example:


In [5]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 22, 28],
    'Salary': [50000, 60000, 75000, 48000, 55000],
    'Experience': [3, 5, 8, 2, 4],
}

df = pd.DataFrame(data)
df.set_index('Name', inplace=True)  # Set 'Name' column as the index

# Using loc[] for Label-Based Indexing
selected_row = df.loc['Bob']
specific_value = df.loc['Bob', 'Age']
sliced_rows = df.loc['Bob':'David']
selected_data = df.loc[['Bob', 'David'], ['Age', 'Salary']]

# Using iloc[] for Position-Based Indexing
selected_row_position = df.iloc[1]
specific_value_position = df.iloc[1, 0]
sliced_rows_position = df.iloc[1:4]
selected_data_position = df.iloc[1:4, [0, 1]]

# Conditional Indexing and Boolean Indexing
condition = df['Age'] > 30
conditionally_selected = df.loc[condition]
boolean_selected = df[df['Age'] > 30]

# Displaying the results
print("Using loc[] for Label-Based Indexing:")
print("Selected Row:\n", selected_row)
print("Specific Value:\n", specific_value)
print("Sliced Rows:\n", sliced_rows)
print("Selected Data:\n", selected_data)

print("\nUsing iloc[] for Position-Based Indexing:")
print("Selected Row by Position:\n", selected_row_position)
print("Specific Value by Position:\n", specific_value_position)
print("Sliced Rows by Position:\n", sliced_rows_position)
print("Selected Data by Position:\n", selected_data_position)

print("\nConditional Indexing and Boolean Indexing:")
print("Conditionally Selected:\n", conditionally_selected)
print("Boolean Selected:\n", boolean_selected)


Using loc[] for Label-Based Indexing:
Selected Row:
 Age              30
Salary        60000
Experience        5
Name: Bob, dtype: int64
Specific Value:
 30
Sliced Rows:
          Age  Salary  Experience
Name                            
Bob       30   60000           5
Charlie   35   75000           8
David     22   48000           2
Selected Data:
        Age  Salary
Name              
Bob     30   60000
David   22   48000

Using iloc[] for Position-Based Indexing:
Selected Row by Position:
 Age              30
Salary        60000
Experience        5
Name: Bob, dtype: int64
Specific Value by Position:
 30
Sliced Rows by Position:
          Age  Salary  Experience
Name                            
Bob       30   60000           5
Charlie   35   75000           8
David     22   48000           2
Selected Data by Position:
          Age  Salary
Name                
Bob       30   60000
Charlie   35   75000
David     22   48000

Conditional Indexing and Boolean Indexing:
Conditionally Sele

#### Real-world Scenario:
Imagine you have a dataset containing information about employees in a company, including their ID, name, department, salary, and performance ratings. You want to perform various operations to analyze and extract specific information about employees.

In [6]:
import pandas as pd

# Sample employee data
employee_data = {
    'EmployeeID': [101, 102, 103, 104, 105],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Department': ['HR', 'IT', 'Sales', 'Finance', 'Marketing'],
    'Salary': [60000, 75000, 80000, 90000, 70000],
    'PerformanceRating': [8.5, 9.2, 7.8, 8.9, 9.5],
}

# Creating a DataFrame from the employee data
df_employees = pd.DataFrame(employee_data)

# Indexing and selecting data
selected_employee = df_employees.loc[df_employees['Name'] == 'Bob']
high_performance_employees = df_employees[df_employees['PerformanceRating'] > 9.0]
selected_columns = df_employees.loc[:, ['Name', 'Department', 'Salary']]

# Displaying the selected data
print("Selected Employee (Bob):\n", selected_employee)
print("\nHigh-Performance Employees:\n", high_performance_employees)
print("\nSelected Columns:\n", selected_columns)


Selected Employee (Bob):
    EmployeeID Name Department  Salary  PerformanceRating
1         102  Bob         IT   75000                9.2

High-Performance Employees:
    EmployeeID  Name Department  Salary  PerformanceRating
1         102   Bob         IT   75000                9.2
4         105  Emma  Marketing   70000                9.5

Selected Columns:
       Name Department  Salary
0    Alice         HR   60000
1      Bob         IT   75000
2  Charlie      Sales   80000
3    David    Finance   90000
4     Emma  Marketing   70000


#### Considerations or Peculiarities:

- **Indexing Choice:** Choose an appropriate column as the index based on your analysis needs. It could be a unique identifier like employee ID or another column that is relevant to your analysis.

- **Boolean Indexing:** Understand how to use boolean indexing effectively. It allows you to filter data based on conditions, as shown in the example with high-performance employees.

- **Column Selection:** Be mindful of the columns you select. If you only need specific columns, it's more efficient to select those rather than the entire DataFrame.

#### Common Mistakes:

- **Incorrect Syntax:** Incorrect use of square brackets, parentheses, or quotation marks in the indexing conditions can lead to errors. Always double-check syntax.

- **Using `==` for Float Comparison:** When comparing float values, be cautious due to potential precision issues. Using methods like `np.isclose()` is recommended for float comparisons.

- **Misunderstanding Boolean Indexing:** Developers may mistakenly think that boolean indexing is limited to exact matches, but it can be used for various conditions.

Indexing and selecting data are crucial skills for extracting relevant information from a DataFrame. Adjust the example code and considerations based on the specifics of your real-world scenarios and datasets.



### **`Descriptive Statistics in Pandas`**

**Introduction:**
Descriptive statistics aim to summarize and describe the main features of a dataset. Pandas provides various functions to compute descriptive statistics for each column in a DataFrame.

**1. Mean:**
   - **Definition:** The mean, also known as the average, is the sum of all values in a dataset divided by the number of observations.
   - **Pandas Code:**
     ```python
     mean_values = df.mean()
     ```
   - **Interpretation:** The mean provides a measure of central tendency, indicating the typical value in a dataset.

**2. Median:**
   - **Definition:** The median is the middle value in a dataset when it is sorted in ascending order. It is less sensitive to extreme values than the mean.
   - **Pandas Code:**
     ```python
     median_values = df.median()
     ```
   - **Interpretation:** The median gives insight into the central position of the data, especially in the presence of outliers.

**3. Mode:**
   - **Definition:** The mode represents the most frequently occurring value(s) in a dataset.
   - **Pandas Code:**
     ```python
     mode_values = df.mode().iloc[0]
     ```
   - **Interpretation:** Identifying the mode helps in understanding the most common values in a dataset.

**4. Standard Deviation:**
   - **Definition:** The standard deviation measures the amount of variation or dispersion in a set of values. A higher standard deviation indicates greater variability.
   - **Pandas Code:**
     ```python
     std_deviation = df.std()
     ```
   - **Interpretation:** Standard deviation is crucial for assessing the spread of values around the mean.

**5. Variance:**
   - **Definition:** Variance is the average of the squared differences from the mean. It is the square of the standard deviation.
   - **Pandas Code:**
     ```python
     variance_values = df.var()
     ```
   - **Interpretation:** Variance provides another measure of data dispersion, useful in comparing the spread of different datasets.

**6. Quantiles and Percentiles:**
   - **Definition:** Quantiles divide a dataset into intervals with equal probabilities. Percentiles are specific quantiles expressed as percentages.
   - **Pandas Code:**
     ```python
     quantiles = df.quantile([0.25, 0.5, 0.75])
     ```
   - **Interpretation:** Quantiles help in understanding the distribution and identifying central points in the data.

**7. Interquartile Range (IQR):**
   - **Definition:** IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). It provides a measure of statistical dispersion.
   - **Pandas Code:**
     ```python
     iqr_values = quantiles.loc[0.75] - quantiles.loc[0.25]
     ```
   - **Interpretation:** IQR is useful for identifying potential outliers and understanding the bulk of the data distribution.

**8. Skewness:**
   - **Definition:** Skewness measures the asymmetry of a distribution. Positive skewness indicates a right-skewed distribution, while negative skewness indicates a left-skewed distribution.
   - **Pandas Code:**
     ```python
     skewness_values = df.skew()
     ```
   - **Interpretation:** Skewness provides insights into the shape of the distribution.

**9. Kurtosis:**
   - **Definition:** Kurtosis measures the sharpness of the peak (or tails) of a distribution. High kurtosis indicates a sharp peak and heavy tails.
   - **Pandas Code:**
     ```python
     kurtosis_values = df.kurt()
     ```
   - **Interpretation:** Kurtosis helps in understanding the tails' thickness and the presence of outliers.

**10. Correlation and Covariance:**
   - **Definition:** Correlation measures the linear relationship between two variables, while covariance measures their joint variability.
   - **Pandas Code:**
     ```python
     correlation_matrix = df.corr()
     covariance_matrix = df.cov()
     ```
   - **Interpretation:** Correlation and covariance are crucial for understanding relationships between variables.

**Conclusion:**
Descriptive statistics in Pandas provide a comprehensive view of the distribution, relationships, and variability within a dataset. Understanding these measures is fundamental for data analysis and decision-making. The choice of which statistics to use depends on the nature of the data and the questions you want to answer.

#### Example :


In [7]:
import pandas as pd
import numpy as np

# Creating a sample DataFrame
data = {
    'Age': [25, 30, 35, 22, 28],
    'Salary': [50000, 60000, 75000, 48000, 55000],
    'Experience': [3, 5, 8, 2, 4],
}

df = pd.DataFrame(data)

# Mean, Median, and Mode
mean_values = df.mean()
median_values = df.median()
mode_values = df.mode().iloc[0]

# Measures of Dispersion
std_deviation = df.std()
variance_values = df.var()

# Quantiles and Percentiles
quantiles = df.quantile([0.25, 0.5, 0.75])
iqr_values = quantiles.loc[0.75] - quantiles.loc[0.25]

# Summary Statistics
summary_stats = df.describe()

# Skewness and Kurtosis
skewness_values = df.skew()
kurtosis_values = df.kurt()

# Correlation and Covariance
correlation_matrix = df.corr()
covariance_matrix = df.cov()

# Displaying the results
print("Mean Values:\n", mean_values)
print("\nMedian Values:\n", median_values)
print("\nMode Values:\n", mode_values)
print("\nStandard Deviation:\n", std_deviation)
print("\nVariance Values:\n", variance_values)
print("\nQuantiles:\n", quantiles)
print("\nInterquartile Range (IQR):\n", iqr_values)
print("\nSummary Statistics:\n", summary_stats)
print("\nSkewness Values:\n", skewness_values)
print("\nKurtosis Values:\n", kurtosis_values)
print("\nCorrelation Matrix:\n", correlation_matrix)
print("\nCovariance Matrix:\n", covariance_matrix)


Mean Values:
 Age              28.0
Salary        57600.0
Experience        4.4
dtype: float64

Median Values:
 Age              28.0
Salary        55000.0
Experience        4.0
dtype: float64

Mode Values:
 Age              22
Salary        48000
Experience        2
Name: 0, dtype: int64

Standard Deviation:
 Age               4.949747
Salary        10784.247772
Experience        2.302173
dtype: float64

Variance Values:
 Age                  24.5
Salary        116300000.0
Experience            5.3
dtype: float64

Quantiles:
        Age   Salary  Experience
0.25  25.0  50000.0         3.0
0.50  28.0  55000.0         4.0
0.75  30.0  60000.0         5.0

Interquartile Range (IQR):
 Age               5.0
Salary        10000.0
Experience        2.0
dtype: float64

Summary Statistics:
              Age        Salary  Experience
count   5.000000      5.000000    5.000000
mean   28.000000  57600.000000    4.400000
std     4.949747  10784.247772    2.302173
min    22.000000  48000.000000    2

#### Real-world Scenario:
Consider a scenario where you have a dataset containing information about the performance of students in an educational institution. The dataset includes student IDs, exam scores in different subjects, attendance percentages, and participation in extracurricular activities. You want to extract descriptive statistics to gain insights into the students' academic performance.

In [8]:
import pandas as pd

# Sample student performance data
student_data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Math_Score': [85, 90, 78, 92, 88],
    'English_Score': [75, 85, 80, 88, 92],
    'Attendance_Percentage': [92, 95, 88, 97, 93],
    'Extracurricular_Participation': [2, 3, 1, 4, 2],
}

# Creating a DataFrame from the student data
df_students = pd.DataFrame(student_data)

# Extracting Descriptive Statistics
mean_scores = df_students.mean()
median_scores = df_students.median()
std_deviation_scores = df_students.std()
attendance_summary = df_students['Attendance_Percentage'].describe()
correlation_matrix = df_students.corr()

# Displaying the Descriptive Statistics
print("Mean Scores:\n", mean_scores)
print("\nMedian Scores:\n", median_scores)
print("\nStandard Deviation of Scores:\n", std_deviation_scores)
print("\nAttendance Summary:\n", attendance_summary)
print("\nCorrelation Matrix:\n", correlation_matrix)


Mean Scores:
 StudentID                         3.0
Math_Score                       86.6
English_Score                    84.0
Attendance_Percentage            93.0
Extracurricular_Participation     2.4
dtype: float64

Median Scores:
 StudentID                         3.0
Math_Score                       88.0
English_Score                    85.0
Attendance_Percentage            93.0
Extracurricular_Participation     2.0
dtype: float64

Standard Deviation of Scores:
 StudentID                        1.581139
Math_Score                       5.458938
English_Score                    6.670832
Attendance_Percentage            3.391165
Extracurricular_Participation    1.140175
dtype: float64

Attendance Summary:
 count     5.000000
mean     93.000000
std       3.391165
min      88.000000
25%      92.000000
50%      93.000000
75%      95.000000
max      97.000000
Name: Attendance_Percentage, dtype: float64

Correlation Matrix:
                                StudentID  Math_Score  English_

#### Considerations or Peculiarities:

- **Data Types:** Ensure that numeric columns are appropriately represented as either integers or floats. Some statistics, like correlation, require numeric data.

- **Missing Values:** Descriptive statistics functions in Pandas automatically exclude missing values. Be aware of missing data and decide how to handle it.

- **Correlation vs. Causation:** Correlation does not imply causation. When interpreting correlation values, be cautious about inferring causal relationships between variables.

#### Common Mistakes:

- **Misinterpreting Correlation:** A common mistake is assuming a strong correlation implies a cause-and-effect relationship. Always consider the context and potential confounding variables.

- **Ignoring Missing Values:** Failing to address missing values before computing descriptive statistics can lead to inaccurate results. Use methods like `dropna()` or `fillna()` appropriately.

- **Inconsistent Data Types:** Ensure that all numeric columns have consistent data types. Mixed data types in a numeric column may cause unexpected results.

Descriptive statistics offer valuable insights into the central tendency, variability, and relationships within a dataset. Consider the specific characteristics of your data when selecting which statistics to compute and interpret the results in the context of your analysis.


### **`6. Data Cleaning and Handling Missing Data`**

#### **`Handling Missing Values`**

#### Importance of Identifying and Handling Missing Values in a DataFrame:

Missing values, represented as NaN (Not a Number) in Pandas, are a common occurrence in real-world datasets. Properly identifying and handling missing values is crucial for meaningful and accurate data analysis. Ignoring missing values can lead to biased results and incorrect interpretations. Here's why handling missing values is important:

1. **Data Accuracy:** Missing values can distort summary statistics, such as mean and standard deviation, leading to inaccurate insights about the dataset.

2. **Model Performance:** If missing values are not addressed, they can adversely impact machine learning models, causing biased predictions and reduced model performance.

3. **Data Visualization:** Visualizations may not accurately represent the distribution of data when missing values are present, affecting the interpretation of results.

4. **Statistical Analyses:** Many statistical analyses and tests assume complete data. Missing values can compromise the validity of statistical results and significance testing.

#### Methods for Handling Missing Values:

1. **Identifying Missing Values:**
   - **`isna()` and `notna()`:**
     ```python
     # Check for missing values
     df.isna()  # Returns a DataFrame of the same shape with True for missing values
     df.notna()  # Returns the opposite of isna()
     ```

2. **Handling Missing Values:**
   - **`fillna()`:**
     ```python
     # Fill missing values with a specified value or a calculated value
     df.fillna(value)  # Fill with a constant value
     df.fillna(df.mean())  # Fill with the mean of each column
     ```

   - **Dropping Missing Values:**
     ```python
     # Drop rows or columns containing missing values
     df.dropna()  # Drop rows with any missing values
     df.dropna(axis=1)  # Drop columns with any missing values
     ```

   - **Interpolation:**
     ```python
     # Interpolate missing values using various methods (linear, polynomial, etc.)
     df.interpolate()
     ```

#### Considerations and Best Practices:

- **Context Matters:** The method chosen to handle missing values depends on the nature of the data and the reason for missingness. Consider the context before applying a specific strategy.

- **Impact on Analysis:** Understand how the chosen method might impact your analysis. For example, filling missing values with the mean could introduce bias if missingness is not random.

- **Visualization:** Visualize the distribution of missing values using tools like heatmaps to better understand patterns of missingness.

- **Documentation:** Clearly document the chosen strategy for handling missing values in your analysis to ensure transparency and reproducibility.

#### Conclusion:

Properly handling missing values is a critical step in the data cleaning process. It ensures the integrity of analyses and models, leading to more reliable and accurate results. Familiarizing yourself with Pandas methods like `isna()`, `notna()`, and `fillna()` empowers you to make informed decisions when dealing with missing data in your DataFrame.

#### Example:
Consider a scenario where we have a DataFrame containing information about students' exam scores in different subjects. The dataset has missing values that need to be handled, and we'll demonstrate the use of `isna()`, `notna()`, and `fillna()` to address these missing values.

In [9]:
import pandas as pd
import numpy as np

# Sample student exam data with missing values
exam_data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Math_Score': [85, np.nan, 78, 92, 88],
    'English_Score': [75, 85, np.nan, 88, 92],
    'Physics_Score': [90, 78, 85, np.nan, 94],
    'Chemistry_Score': [82, 88, 90, 76, np.nan],
}

# Creating a DataFrame from the exam data
df_exams = pd.DataFrame(exam_data)

# Displaying the original DataFrame
print("Original DataFrame:")
print(df_exams)

# Identifying missing values
missing_values = df_exams.isna()
print("\nMissing Values:")
print(missing_values)

# Filling missing values with the mean of each column
mean_filled_df = df_exams.fillna(df_exams.mean())

# Displaying the DataFrame after handling missing values
print("\nDataFrame after Filling Missing Values with Mean:")
print(mean_filled_df)


Original DataFrame:
   StudentID  Math_Score  English_Score  Physics_Score  Chemistry_Score
0          1        85.0           75.0           90.0             82.0
1          2         NaN           85.0           78.0             88.0
2          3        78.0            NaN           85.0             90.0
3          4        92.0           88.0            NaN             76.0
4          5        88.0           92.0           94.0              NaN

Missing Values:
   StudentID  Math_Score  English_Score  Physics_Score  Chemistry_Score
0      False       False          False          False            False
1      False        True          False          False            False
2      False       False           True          False            False
3      False       False          False           True            False
4      False       False          False          False             True

DataFrame after Filling Missing Values with Mean:
   StudentID  Math_Score  English_Score  Physics

In the above example:

1. **Identifying Missing Values:**
   - We use `isna()` to create a DataFrame of the same shape as the original, with `True` values where missing values are present.

2. **Handling Missing Values:**
   - We use `fillna()` to fill missing values with the mean of each column.

3. **Result:**
   - The final DataFrame (`mean_filled_df`) has missing values filled with the mean of each respective column.

This example showcases the importance of identifying and handling missing values and demonstrates a practical approach using Pandas methods. Adjust the code based on your specific dataset and requirements.

#### Real-world Scenario:
Consider a scenario where you have a dataset containing information about customer orders in an e-commerce platform. The dataset includes order IDs, product names, quantities, prices, and shipping dates. Due to various reasons such as system glitches or customer actions, some data is missing. Let's explore how to identify and handle missing values in this context.

In [10]:
import pandas as pd
import numpy as np

# Sample e-commerce order data with missing values
order_data = {
    'OrderID': [101, 102, np.nan, 104, 105],
    'Product': ['Laptop', 'Smartphone', 'Tablet', np.nan, 'Camera'],
    'Quantity': [2, 1, np.nan, 2, 1],
    'Price': [1200, 800, 300, np.nan, 700],
    'Shipping_Date': ['2022-01-01', '2022-01-02', np.nan, '2022-01-03', '2022-01-03'],
}

# Creating a DataFrame from the order data
df_orders = pd.DataFrame(order_data)

# Displaying the original DataFrame
print("Original Order DataFrame:")
print(df_orders)

# Identifying missing values
missing_values = df_orders.isna()
print("\nMissing Values:")
print(missing_values)

# Handling missing values by dropping rows with missing OrderID and filling Price and Shipping_Date
df_orders_cleaned = df_orders.dropna(subset=['OrderID']).fillna({'Price': df_orders['Price'].mean(), 'Shipping_Date': '2022-01-01'})

# Displaying the DataFrame after handling missing values
print("\nDataFrame after Handling Missing Values:")
print(df_orders_cleaned)


Original Order DataFrame:
   OrderID     Product  Quantity   Price Shipping_Date
0    101.0      Laptop       2.0  1200.0    2022-01-01
1    102.0  Smartphone       1.0   800.0    2022-01-02
2      NaN      Tablet       NaN   300.0           NaN
3    104.0         NaN       2.0     NaN    2022-01-03
4    105.0      Camera       1.0   700.0    2022-01-03

Missing Values:
   OrderID  Product  Quantity  Price  Shipping_Date
0    False    False     False  False          False
1    False    False     False  False          False
2     True    False      True  False           True
3    False     True     False   True          False
4    False    False     False  False          False

DataFrame after Handling Missing Values:
   OrderID     Product  Quantity   Price Shipping_Date
0    101.0      Laptop       2.0  1200.0    2022-01-01
1    102.0  Smartphone       1.0   800.0    2022-01-02
3    104.0         NaN       2.0   750.0    2022-01-03
4    105.0      Camera       1.0   700.0    2022-01-0

#### Considerations or Peculiarities:

- **Reasons for Missingness:**
  - Understand the reasons for missing values. In this example, missing OrderID might be due to a system error, missing Product might be due to a new product without details, and missing Price and Shipping_Date might be due to incomplete data.

- **Impact on Analysis:**
  - Consider how missing values might impact your analysis. Dropping rows or filling missing values should align with the analysis goals.

- **Domain Knowledge:**
  - Domain knowledge is crucial for deciding how to handle missing values appropriately. For example, filling a missing Price with the mean might not be suitable if prices vary significantly.

#### Common Mistakes:

- **Ignoring Missing Values:**
  - Ignoring missing values without assessing their impact on analyses can lead to biased results.

- **Unintended Dropping:**
  - Unintentionally dropping rows or columns without considering the reasons for missingness may result in data loss and incomplete analyses.

- **Inconsistent Handling:**
  - Inconsistently handling missing values across different columns or datasets can introduce inconsistencies in your analysis.

Handling missing values requires careful consideration and should be aligned with the overall data analysis goals. It's essential to understand the dataset's context and choose appropriate strategies based on the nature of the missing data.


#### **`Dropping or Filling Missing Values`**

#### Decision-Making Process:

1. **Dropping Missing Values:**
   - **Context:** Dropping missing values is suitable when the missingness is random, and removing incomplete records doesn't introduce bias or impact the analysis significantly. It's a pragmatic approach when the missing data is negligible compared to the dataset size.

   - **Example:**
     ```python
     # Drop rows with any missing values
     df_dropped = df.dropna()
     ```

2. **Filling Missing Values:**
   - **Context:** Filling missing values is appropriate when retaining the incomplete records is crucial, and a reasonable estimation can be made for the missing values. This is common when dealing with time-series data, where continuity matters.

   - **Example:**
     ```python
     # Fill missing values in 'column_name' with a constant value
     df_filled_constant = df.fillna(value=0)
     ```

   - **Example:**
     ```python
     # Fill missing values with the mean of each column
     df_filled_mean = df.fillna(df.mean())
     ```

   - **Example:**
     ```python
     # Forward fill missing values in a DataFrame
     df_forward_filled = df.ffill()
     ```

   - **Example:**
     ```python
     # Backward fill missing values in a DataFrame
     df_backward_filled = df.bfill()
     ```

   - **Example:**
     ```python
     # Interpolate missing values using linear interpolation
     df_interpolated_linear = df.interpolate(method='linear')
     ```

#### Considerations:

- **Data Nature:**
  - Consider the nature of the data. For time-series data, forward or backward filling might be suitable, while for numeric data, mean or interpolation might be appropriate.

- **Impact on Analysis:**
  - Evaluate how the chosen method for handling missing values might impact subsequent analyses. Ensure that the imputation method aligns with the overall analysis goals.

- **Domain Knowledge:**
  - Leverage domain knowledge to make informed decisions. Some missing values may be inherently unfillable due to the nature of the data.

#### Conclusion:

The decision between dropping or filling missing values depends on the specific characteristics of the data and the analysis goals. Dropping values is a straightforward approach but may lead to data loss. Filling values is a more nuanced process, requiring careful consideration of the data's nature and the impact on downstream analyses. Experiment with different strategies and choose the one that best fits the context of your dataset.

#### Example:

Let's consider a scenario where we have a DataFrame representing monthly sales data for a product. The dataset has missing values in the 'Sales' column, and we need to decide whether to drop or fill those missing values based on the context.

In [11]:
import pandas as pd
import numpy as np

# Sample monthly sales data with missing values
sales_data = {
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'Sales': [100, 120, np.nan, 150, np.nan, 180],
}

# Creating a DataFrame from the sales data
df_sales = pd.DataFrame(sales_data)

# Displaying the original DataFrame
print("Original Sales DataFrame:")
print(df_sales)

# Decision 1: Dropping Missing Values
df_dropped = df_sales.dropna()

# Displaying the DataFrame after dropping missing values
print("\nDataFrame after Dropping Missing Values:")
print(df_dropped)

# Decision 2: Filling Missing Values with Forward Fill
df_filled_forward = df_sales.ffill()

# Displaying the DataFrame after forward filling missing values
print("\nDataFrame after Forward Filling Missing Values:")
print(df_filled_forward)

# Decision 3: Filling Missing Values with Mean
df_filled_mean = df_sales.fillna(df_sales['Sales'].mean())

# Displaying the DataFrame after filling missing values with mean
print("\nDataFrame after Filling Missing Values with Mean:")
print(df_filled_mean)


Original Sales DataFrame:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
2   Mar    NaN
3   Apr  150.0
4   May    NaN
5   Jun  180.0

DataFrame after Dropping Missing Values:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
3   Apr  150.0
5   Jun  180.0

DataFrame after Forward Filling Missing Values:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
2   Mar  120.0
3   Apr  150.0
4   May  150.0
5   Jun  180.0

DataFrame after Filling Missing Values with Mean:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
2   Mar  137.5
3   Apr  150.0
4   May  137.5
5   Jun  180.0


In this example:

1. **Dropping Missing Values:**
   - We use `dropna()` to remove rows with any missing values. This might be suitable if missing values are limited and their removal doesn't significantly affect the analysis.

2. **Filling Missing Values with Forward Fill:**
   - We use `ffill()` to fill missing values with the previous month's sales. This approach is reasonable when the missing values follow a pattern and can be reasonably estimated using existing data.

3. **Filling Missing Values with Mean:**
   - We use `fillna()` with the mean of the 'Sales' column to impute missing values. This approach is suitable when we want to retain all rows and fill missing values with a representative value.

Adjust the code based on the specific characteristics of your dataset and the analysis goals. Choosing between dropping or filling missing values should be driven by the dataset's context and the impact on subsequent analyses.

#### Real-world Scenario:

Imagine you are managing a dataset that tracks monthly sales data for a retail business. The dataset includes information such as the month, product category, sales quantity, and revenue. However, due to occasional reporting errors or data collection issues, there are missing values in the dataset. Let's explore how to handle these missing values using Pandas.

In [12]:
import pandas as pd
import numpy as np

# Sample monthly sales data with missing values
sales_data = {
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'Category': ['Electronics', 'Clothing', np.nan, 'Electronics', np.nan, 'Clothing'],
    'Sales_Quantity': [120, 150, np.nan, 200, np.nan, 180],
    'Revenue': [12000, np.nan, 18000, np.nan, 25000, 22000],
}

# Creating a DataFrame from the sales data
df_sales = pd.DataFrame(sales_data)

# Displaying the original DataFrame
print("Original Monthly Sales DataFrame:")
print(df_sales)

# Handling Missing Values:
# Decision 1: Dropping rows with any missing values
df_dropped = df_sales.dropna()

# Decision 2: Filling missing values with mean for numerical columns
df_filled_mean = df_sales.fillna({'Sales_Quantity': df_sales['Sales_Quantity'].mean(), 'Revenue': df_sales['Revenue'].mean()})

# Displaying the DataFrames after handling missing values
print("\nDataFrame after Dropping Missing Values:")
print(df_dropped)

print("\nDataFrame after Filling Missing Values with Mean:")
print(df_filled_mean)


Original Monthly Sales DataFrame:
  Month     Category  Sales_Quantity  Revenue
0   Jan  Electronics           120.0  12000.0
1   Feb     Clothing           150.0      NaN
2   Mar          NaN             NaN  18000.0
3   Apr  Electronics           200.0      NaN
4   May          NaN             NaN  25000.0
5   Jun     Clothing           180.0  22000.0

DataFrame after Dropping Missing Values:
  Month     Category  Sales_Quantity  Revenue
0   Jan  Electronics           120.0  12000.0
5   Jun     Clothing           180.0  22000.0

DataFrame after Filling Missing Values with Mean:
  Month     Category  Sales_Quantity  Revenue
0   Jan  Electronics           120.0  12000.0
1   Feb     Clothing           150.0  19250.0
2   Mar          NaN           162.5  18000.0
3   Apr  Electronics           200.0  19250.0
4   May          NaN           162.5  25000.0
5   Jun     Clothing           180.0  22000.0


#### Considerations or Peculiarities:

- **Imputation Strategy:**
  - Choosing between dropping and filling depends on the impact on analysis. Dropping may lead to loss of important information, while filling may introduce bias if not done carefully.

- **Context of Data:**
  - Understand the context of your data. For example, filling missing revenue values with the mean might be reasonable, but for product categories, it may not make sense.

- **Column-specific Strategies:**
  - Different columns may require different strategies. For numeric columns, mean or median filling could be appropriate, while for categorical columns, forward fill or mode filling might be more suitable.

#### Common Mistakes:

- **Unintended Data Loss:**
  - Developers might drop rows without considering the impact on the dataset's integrity. This can lead to unintended data loss, especially if the missing values are not randomly distributed.

- **Inconsistent Imputation:**
  - Filling missing values inconsistently across columns or datasets can introduce inconsistencies in the dataset.

- **Overlooking Context:**
  - Filling missing values without understanding the context of the data and the reasons for missingness may lead to inaccurate imputations.

Handling missing values is a critical aspect of data preprocessing. It requires thoughtful consideration of the dataset's context, the nature of missingness, and the impact on downstream analyses. Developers should choose strategies that align with the goals of their analysis and avoid common pitfalls that can compromise data quality.

#### **`Removing Duplicates in a DataFrame`**

#### Significance of Identifying and Removing Duplicate Rows:

**1. Data Accuracy:**
   - Duplicate rows can distort analyses by inflating counts, averages, or other summary statistics. Removing duplicates ensures the accuracy of calculated metrics.

**2. Consistent Results:**
   - Duplicates can lead to inconsistencies in results, especially in scenarios where aggregated data or distinct counts are essential.

**3. Efficient Memory Usage:**
   - Datasets with duplicate rows consume more memory. Eliminating duplicates optimizes memory usage and enhances computational efficiency.

**4. Meaningful Insights:**
   - Duplicate rows may not contribute meaningful insights but can skew results. Removing them ensures a cleaner dataset for analysis.

#### Examples of Removing Duplicates:

**1. Identifying Duplicate Rows:**
```python
# Check for duplicate rows based on all columns
duplicates = df.duplicated()

# Check for duplicate rows based on specific columns
duplicates_specific_columns = df.duplicated(subset=['Column1', 'Column2'])
```

**2. Removing Duplicate Rows:**
```python
# Remove all duplicate rows, keeping the first occurrence
df_no_duplicates = df.drop_duplicates()

# Remove duplicate rows based on specific columns, keeping the first occurrence
df_no_duplicates_specific_columns = df.drop_duplicates(subset=['Column1', 'Column2'])
```

#### Considerations:

- **Column Selection:**
  - Consider the columns relevant to duplicate identification. In some cases, duplicates may only be duplicates when considering specific columns.

- **Order Matters:**
  - `drop_duplicates()` retains the first occurrence and removes subsequent duplicates. Ensure the order aligns with your analysis goals.

- **In-Place vs. New DataFrame:**
  - Decide whether to modify the existing DataFrame in-place or create a new one. Choose based on the need to retain the original data.

#### Common Mistakes:

- **Ignoring Specific Columns:**
  - Failing to specify columns during duplicate checking can result in unintended removal of rows that might be duplicates only in certain columns.

- **Overlooking Order:**
  - If retaining the first occurrence is essential, ensure that the DataFrame is sorted appropriately before using `drop_duplicates()`.

- **Inconsistent Usage:**
  - Inconsistently applying duplicate removal across different datasets or analyses can lead to inconsistent results.

#### Conclusion:

Identifying and removing duplicate rows is a crucial step in data cleaning and preprocessing. It enhances the accuracy of analyses, ensures meaningful insights, and optimizes memory usage. Developers should carefully consider the columns involved, the order of removal, and whether to modify the DataFrame in-place when handling duplicates.

#### Example:

Let's consider a scenario where you have a DataFrame containing data on customer orders, and due to data entry errors or system glitches, there are duplicate entries. We'll explore how to identify and remove these duplicate rows using Pandas.

In [13]:
import pandas as pd

# Sample order data with duplicate entries
order_data = {
    'OrderID': [101, 102, 101, 103, 104, 102],
    'Product': ['Laptop', 'Smartphone', 'Laptop', 'Tablet', 'Camera', 'Smartphone'],
    'Quantity': [2, 1, 1, 3, 1, 1],
    'Total_Price': [1200, 800, 1200, 450, 700, 800],
}

# Creating a DataFrame from the order data
df_orders = pd.DataFrame(order_data)

# Displaying the original DataFrame
print("Original Order DataFrame:")
print(df_orders)

# Identifying Duplicate Rows
duplicates = df_orders.duplicated()

# Displaying duplicate rows
print("\nDuplicate Rows:")
print(df_orders[duplicates])

# Removing Duplicate Rows
df_no_duplicates = df_orders.drop_duplicates()

# Displaying the DataFrame after removing duplicates
print("\nDataFrame after Removing Duplicates:")
print(df_no_duplicates)


Original Order DataFrame:
   OrderID     Product  Quantity  Total_Price
0      101      Laptop         2         1200
1      102  Smartphone         1          800
2      101      Laptop         1         1200
3      103      Tablet         3          450
4      104      Camera         1          700
5      102  Smartphone         1          800

Duplicate Rows:
   OrderID     Product  Quantity  Total_Price
5      102  Smartphone         1          800

DataFrame after Removing Duplicates:
   OrderID     Product  Quantity  Total_Price
0      101      Laptop         2         1200
1      102  Smartphone         1          800
2      101      Laptop         1         1200
3      103      Tablet         3          450
4      104      Camera         1          700


In this example:

1. **Identifying Duplicate Rows:**
   - We use `duplicated()` to identify duplicate rows based on all columns. The result is a boolean series indicating which rows are duplicates.

2. **Displaying Duplicate Rows:**
   - We use boolean indexing to display the rows that are identified as duplicates.

3. **Removing Duplicate Rows:**
   - We use `drop_duplicates()` to remove duplicate rows, keeping the first occurrence of each unique row.

4. **Displaying Result:**
   - We display the DataFrame after removing duplicates to see the cleaned dataset.

Adjust the code based on your specific dataset and analysis goals. Understanding the significance of removing duplicates and applying these methods ensures a cleaner and more reliable dataset for further analysis.

#### Considerations or Peculiarities:

- **Column Selection:**
  - Consider which columns should be considered for identifying duplicates. In some cases, duplicates may only be duplicates when considering specific columns.

- **Impact on Analysis:**
  - Consider how duplicate rows might impact subsequent analyses. Retaining duplicates might skew results, while removing them ensures a cleaner dataset.

#### Common Mistakes:

- **Incomplete Duplicate Identification:**
  - Not considering all relevant columns during duplicate identification might result in incomplete removal of duplicates.

- **Ignoring Context:**
  - Failing to understand the context of the data might lead to unintended removal of rows that may be legitimate duplicates.

- **Overlooking Order:**
  - Forgetting to sort the DataFrame appropriately before using `drop_duplicates()` may lead to unexpected results if order matters.

Handling duplicate rows is essential for maintaining data accuracy and ensuring meaningful analyses. Developers should carefully choose columns for duplicate identification, understand the impact of duplicates on analysis, and avoid common mistakes that could compromise data integrity.

### **`Hands On Experience:`**


### Question 1: Creating a DataFrame from Lists and Basic Operations

#### Scenario:
You have information about monthly sales for a retail store. Each list contains data for a different month.

```python
# Data for three months
months = ['Jan', 'Feb', 'Mar']
sales = [1200, 1500, 1800]
expenses = [800, 900, 1000]

# Question:
# Create a DataFrame named 'df_sales' from these lists, and display the DataFrame.
# Calculate the profit for each month (Profit = Sales - Expenses).
# Display the DataFrame after adding the 'Profit' column.
```

In [14]:
# Data for three months
months = ['Jan', 'Feb', 'Mar']
sales = [1200, 1500, 1800]
expenses = [800, 900, 1000]

import pandas as pd

# Creating a DataFrame from Lists
df_sales = pd.DataFrame({'Month': months, 'Sales': sales, 'Expenses': expenses})

# Calculating Profit
df_sales['Profit'] = df_sales['Sales'] - df_sales['Expenses']

# Displaying the DataFrame
print("DataFrame after Creating and Calculating Profit:")
print(df_sales)

DataFrame after Creating and Calculating Profit:
  Month  Sales  Expenses  Profit
0   Jan   1200       800     400
1   Feb   1500       900     600
2   Mar   1800      1000     800


### Question 2: Reading Data from CSV and Descriptive Statistics

#### Scenario:

Let's assume you have a CSV file named 'sales_data.csv' with the following structure:

```csv
Product,Quantity,Revenue
Laptop,10,12000
Smartphone,5,8000
Tablet,,4500
Camera,3,
```

You have a CSV file named 'sales_data.csv' containing information about product sales. Read the data into a DataFrame and perform descriptive statistics.

```python
# Question:
# Read 'sales_data.csv' into a DataFrame named 'df_sales'.
# Display the first 5 rows of the DataFrame.
# Calculate basic descriptive statistics for the 'Quantity' column.
```



In [16]:
import pandas as pd

# Reading Data from CSV
df_sales = pd.read_csv('sales_data.csv')

# Displaying the first 5 rows
print("First 5 Rows of df_sales:")
print(df_sales.head())

# Descriptive Statistics for 'Quantity'
quantity_stats = df_sales['Quantity'].describe()
print("\nDescriptive Statistics for 'Quantity':")
print(quantity_stats)

First 5 Rows of df_sales:
      Product  Quantity  Revenue
0      Laptop      10.0  12000.0
1  Smartphone       5.0   8000.0
2      Tablet       NaN   4500.0
3      Camera       3.0      NaN

Descriptive Statistics for 'Quantity':
count     3.000000
mean      6.000000
std       3.605551
min       3.000000
25%       4.000000
50%       5.000000
75%       7.500000
max      10.000000
Name: Quantity, dtype: float64


### Question 3: Handling Missing Values and Filling with Mean

#### Scenario:
Your DataFrame has missing values in the 'Revenue' column. Handle the missing values by filling them with the mean.

```python
# Question:
# Handle missing values in the 'Revenue' column by filling them with the mean.
# Display the DataFrame after handling missing values.
```

In [17]:
# Handling Missing Values in 'Revenue'
df_sales['Revenue'].fillna(df_sales['Revenue'].mean(), inplace=True)

# Displaying the DataFrame after Handling Missing Values
print("DataFrame after Handling Missing Values in 'Revenue':")
print(df_sales)

DataFrame after Handling Missing Values in 'Revenue':
      Product  Quantity       Revenue
0      Laptop      10.0  12000.000000
1  Smartphone       5.0   8000.000000
2      Tablet       NaN   4500.000000
3      Camera       3.0   8166.666667


### Question 4: Removing Duplicates

#### Scenario:
Your DataFrame 'df_orders' contains duplicate entries for customer orders. Remove the duplicates based on all columns.

```python
# Question:
# Identify and remove duplicate rows from 'df_orders'.
# Display the DataFrame after removing duplicates.
```

In [18]:
# Identifying Duplicate Rows
duplicates = df_orders.duplicated()

# Removing Duplicate Rows
df_orders_no_duplicates = df_orders.drop_duplicates()

# Displaying the DataFrame after Removing Duplicates
print("DataFrame after Removing Duplicates:")
print(df_orders_no_duplicates)

DataFrame after Removing Duplicates:
   OrderID     Product  Quantity  Total_Price
0      101      Laptop         2         1200
1      102  Smartphone         1          800
2      101      Laptop         1         1200
3      103      Tablet         3          450
4      104      Camera         1          700


### Question 5: Conditional Indexing and Filtering

#### Scenario:
You want to analyze only the orders with a quantity greater than 2.

```python
# Question:
# Create a new DataFrame 'df_large_orders' containing only the orders with Quantity greater than 2.
# Display the new DataFrame.
```

In [19]:
# Conditional Indexing and Filtering
df_large_orders = df_orders[df_orders['Quantity'] > 2]

# Displaying the DataFrame with Large Orders
print("DataFrame with Orders Quantity > 2:")
print(df_large_orders)

DataFrame with Orders Quantity > 2:
   OrderID Product  Quantity  Total_Price
3      103  Tablet         3          450
