# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#2: DataFrames in Depth`**

4. **Creating DataFrames**
    
    - From lists, dictionaries, and arrays
    - Reading data from CSV, Excel, and other formats
5. **Basic DataFrame Operations**
    
    - Inspecting the DataFrame
    - Indexing and selecting data
    - Descriptive statistics
6. **Data Cleaning and Handling Missing Data**
    
    - Handling missing values
    - Dropping or filling missing values
    - Removing duplicates

### **`5. Basic DataFrame Operations`**


#### **`Descriptive Statistics in Pandas`**

**Introduction:**
Descriptive statistics aim to summarize and describe the main features of a dataset. Pandas provides various functions to compute descriptive statistics for each column in a DataFrame.

**1. Mean:**
   - **Definition:** The mean, also known as the average, is the sum of all values in a dataset divided by the number of observations.
   - **Pandas Code:**
     ```python
     mean_values = df.mean()
     ```
   - **Interpretation:** The mean provides a measure of central tendency, indicating the typical value in a dataset.

**2. Median:**
   - **Definition:** The median is the middle value in a dataset when it is sorted in ascending order. It is less sensitive to extreme values than the mean.
   - **Pandas Code:**
     ```python
     median_values = df.median()
     ```
   - **Interpretation:** The median gives insight into the central position of the data, especially in the presence of outliers.

**3. Mode:**
   - **Definition:** The mode represents the most frequently occurring value(s) in a dataset.
   - **Pandas Code:**
     ```python
     mode_values = df.mode().iloc[0]
     ```
   - **Interpretation:** Identifying the mode helps in understanding the most common values in a dataset.

**4. Standard Deviation:**
   - **Definition:** The standard deviation measures the amount of variation or dispersion in a set of values. A higher standard deviation indicates greater variability.
   - **Pandas Code:**
     ```python
     std_deviation = df.std()
     ```
   - **Interpretation:** Standard deviation is crucial for assessing the spread of values around the mean.

**5. Variance:**
   - **Definition:** Variance is the average of the squared differences from the mean. It is the square of the standard deviation.
   - **Pandas Code:**
     ```python
     variance_values = df.var()
     ```
   - **Interpretation:** Variance provides another measure of data dispersion, useful in comparing the spread of different datasets.

**6. Quantiles and Percentiles:**
   - **Definition:** Quantiles divide a dataset into intervals with equal probabilities. Percentiles are specific quantiles expressed as percentages.
   - **Pandas Code:**
     ```python
     quantiles = df.quantile([0.25, 0.5, 0.75])
     ```
   - **Interpretation:** Quantiles help in understanding the distribution and identifying central points in the data.

**7. Interquartile Range (IQR):**
   - **Definition:** IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). It provides a measure of statistical dispersion.
   - **Pandas Code:**
     ```python
     iqr_values = quantiles.loc[0.75] - quantiles.loc[0.25]
     ```
   - **Interpretation:** IQR is useful for identifying potential outliers and understanding the bulk of the data distribution.

**8. Skewness:**
   - **Definition:** Skewness measures the asymmetry of a distribution. Positive skewness indicates a right-skewed distribution, while negative skewness indicates a left-skewed distribution.
   - **Pandas Code:**
     ```python
     skewness_values = df.skew()
     ```
   - **Interpretation:** Skewness provides insights into the shape of the distribution.

**9. Kurtosis:**
   - **Definition:** Kurtosis measures the sharpness of the peak (or tails) of a distribution. High kurtosis indicates a sharp peak and heavy tails.
   - **Pandas Code:**
     ```python
     kurtosis_values = df.kurt()
     ```
   - **Interpretation:** Kurtosis helps in understanding the tails' thickness and the presence of outliers.

**10. Correlation and Covariance:**
   - **Definition:** Correlation measures the linear relationship between two variables, while covariance measures their joint variability.
   - **Pandas Code:**
     ```python
     correlation_matrix = df.corr()
     covariance_matrix = df.cov()
     ```
   - **Interpretation:** Correlation and covariance are crucial for understanding relationships between variables.

**Conclusion:**
Descriptive statistics in Pandas provide a comprehensive view of the distribution, relationships, and variability within a dataset. Understanding these measures is fundamental for data analysis and decision-making. The choice of which statistics to use depends on the nature of the data and the questions you want to answer.

#### Example :


In [10]:
import pandas as pd
import numpy as np

# Creating a sample DataFrame
data = {
    'Age': [25, 30, 35, 22, 28],
    'Salary': [50000, 60000, 75000, 48000, 55000],
    'Experience': [3, 5, 8, 2, 4],
}

df = pd.DataFrame(data)

# Mean, Median, and Mode
mean_values = df.mean()
median_values = df.median()
mode_values = df.mode().iloc[0]

# Measures of Dispersion
std_deviation = df.std()
variance_values = df.var()

# Quantiles and Percentiles
quantiles = df.quantile([0.25, 0.5, 0.75])
iqr_values = quantiles.loc[0.75] - quantiles.loc[0.25]

# Summary Statistics
summary_stats = df.describe()

# Skewness and Kurtosis
skewness_values = df.skew()
kurtosis_values = df.kurt()

# Correlation and Covariance
correlation_matrix = df.corr()
covariance_matrix = df.cov()

# Displaying the results
print("Mean Values:\n", mean_values)
print("\nMedian Values:\n", median_values)
print("\nMode Values:\n", mode_values)
print("\nStandard Deviation:\n", std_deviation)
print("\nVariance Values:\n", variance_values)
print("\nQuantiles:\n", quantiles)
print("\nInterquartile Range (IQR):\n", iqr_values)
print("\nSummary Statistics:\n", summary_stats)
print("\nSkewness Values:\n", skewness_values)
print("\nKurtosis Values:\n", kurtosis_values)
print("\nCorrelation Matrix:\n", correlation_matrix)
print("\nCovariance Matrix:\n", covariance_matrix)


Mean Values:
 Age              28.0
Salary        57600.0
Experience        4.4
dtype: float64

Median Values:
 Age              28.0
Salary        55000.0
Experience        4.0
dtype: float64

Mode Values:
 Age              22
Salary        48000
Experience        2
Name: 0, dtype: int64

Standard Deviation:
 Age               4.949747
Salary        10784.247772
Experience        2.302173
dtype: float64

Variance Values:
 Age                  24.5
Salary        116300000.0
Experience            5.3
dtype: float64

Quantiles:
        Age   Salary  Experience
0.25  25.0  50000.0         3.0
0.50  28.0  55000.0         4.0
0.75  30.0  60000.0         5.0

Interquartile Range (IQR):
 Age               5.0
Salary        10000.0
Experience        2.0
dtype: float64

Summary Statistics:
              Age        Salary  Experience
count   5.000000      5.000000    5.000000
mean   28.000000  57600.000000    4.400000
std     4.949747  10784.247772    2.302173
min    22.000000  48000.000000    2

#### Real-world Scenario:
Consider a scenario where you have a dataset containing information about the performance of students in an educational institution. The dataset includes student IDs, exam scores in different subjects, attendance percentages, and participation in extracurricular activities. You want to extract descriptive statistics to gain insights into the students' academic performance.

In [11]:
import pandas as pd

# Sample student performance data
student_data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Math_Score': [85, 90, 78, 92, 88],
    'English_Score': [75, 85, 80, 88, 92],
    'Attendance_Percentage': [92, 95, 88, 97, 93],
    'Extracurricular_Participation': [2, 3, 1, 4, 2],
}

# Creating a DataFrame from the student data
df_students = pd.DataFrame(student_data)

# Extracting Descriptive Statistics
mean_scores = df_students.mean()
median_scores = df_students.median()
std_deviation_scores = df_students.std()
attendance_summary = df_students['Attendance_Percentage'].describe()
correlation_matrix = df_students.corr()

# Displaying the Descriptive Statistics
print("Mean Scores:\n", mean_scores)
print("\nMedian Scores:\n", median_scores)
print("\nStandard Deviation of Scores:\n", std_deviation_scores)
print("\nAttendance Summary:\n", attendance_summary)
print("\nCorrelation Matrix:\n", correlation_matrix)


Mean Scores:
 StudentID                         3.0
Math_Score                       86.6
English_Score                    84.0
Attendance_Percentage            93.0
Extracurricular_Participation     2.4
dtype: float64

Median Scores:
 StudentID                         3.0
Math_Score                       88.0
English_Score                    85.0
Attendance_Percentage            93.0
Extracurricular_Participation     2.0
dtype: float64

Standard Deviation of Scores:
 StudentID                        1.581139
Math_Score                       5.458938
English_Score                    6.670832
Attendance_Percentage            3.391165
Extracurricular_Participation    1.140175
dtype: float64

Attendance Summary:
 count     5.000000
mean     93.000000
std       3.391165
min      88.000000
25%      92.000000
50%      93.000000
75%      95.000000
max      97.000000
Name: Attendance_Percentage, dtype: float64

Correlation Matrix:
                                StudentID  Math_Score  English_