# Exploratory Data Analysis

### We will use describe(), info(), value_counts().
### Explore summary statistics
### Analyze numerical & categorical data

#### Exploratory Data Analysis is the initial step in data analysis where we explore the dataset to understand its structure, detect patterns, spot anomalies, test hypotheses, and check assumption using summary stats, visualization and grouping techniques

**1. Reading the CSV**

In [2]:
import pandas as pd
df = pd.read_csv("students.csv")

**2. Looking at the Structure**

In [3]:
df.head()

Unnamed: 0,Name,Gender,Age,Course,Score,Passed
0,Ravi,Male,21,Python,88,Yes
1,Sneha,Female,22,Java,79,Yes
2,Arjun,Male,20,Python,62,No
3,Diya,Female,23,Data Science,91,Yes
4,Rishi,Male,24,Python,55,No


In [4]:
df.tail()

Unnamed: 0,Name,Gender,Age,Course,Score,Passed
0,Ravi,Male,21,Python,88,Yes
1,Sneha,Female,22,Java,79,Yes
2,Arjun,Male,20,Python,62,No
3,Diya,Female,23,Data Science,91,Yes
4,Rishi,Male,24,Python,55,No


In [6]:
df.shape

(5, 6)

**3. Dataset Overview**

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Gender  5 non-null      object
 2   Age     5 non-null      int64 
 3   Course  5 non-null      object
 4   Score   5 non-null      int64 
 5   Passed  5 non-null      object
dtypes: int64(2), object(4)
memory usage: 372.0+ bytes


__4. Descriptive Statistics__

In [9]:
df.describe()

Unnamed: 0,Age,Score
count,5.0,5.0
mean,22.0,75.0
std,1.581139,15.890249
min,20.0,55.0
25%,21.0,62.0
50%,22.0,79.0
75%,23.0,88.0
max,24.0,91.0


| Metric        | Meaning                                          |
| ------------- | ------------------------------------------------ |
| count         | Number of non-null entries                       |
| mean          | Average value                                    |
| std           | Standard deviation                               |
| min/max       | Smallest and largest values                      |
| 25%, 50%, 75% | Percentile spread (useful for outlier detection) |


__5. Analyzing Categorical Data__

In [None]:
df['Course'].value_counts() 
# Gives frequency of each unique values

Course
Python          3
Java            1
Data Science    1
Name: count, dtype: int64

In [None]:
df['Gender'].value_counts(normalize=True) 
# In a gender equality study, this can tell if there's a skewed make:female ratio in applicants

Gender
Male      0.6
Female    0.4
Name: proportion, dtype: float64

__6. Grouped Insights__

In [14]:
df.groupby('Course')['Score'].mean()
# Will return average of the scores of corresponding courses-> Gives avg score per course

Course
Data Science    91.000000
Java            79.000000
Python          68.333333
Name: Score, dtype: float64

In [None]:
df.groupby('Course')['Score'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Course,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Data Science,1.0,91.0,,91.0,91.0,91.0,91.0,91.0
Java,1.0,79.0,,79.0,79.0,79.0,79.0,79.0
Python,3.0,68.333333,17.387735,55.0,58.5,62.0,75.0,88.0


In [16]:
df.groupby('Course')['Passed'].value_counts()

Course        Passed
Data Science  Yes       1
Java          Yes       1
Python        No        2
              Yes       1
Name: count, dtype: int64