# Statistics:
 The branch of mathematics that deals with collecting, organizing, analyzing, and interpreting data for decision-making.

# Types

* Descriptive Statistics

* Inferences

* Probability Distribution

# Descriptive Statistics

A branch of statistics that summarizes and describes the main features of a dataset using measures like mean, median, mode, variance, and visualizations (charts, graphs, tables) without making predictions or inferences.

### Import the Required Libraries

In [34]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Load the Dataset

In [35]:
df=pd.read_csv(r"C:\Users\User-PC\Downloads\Statistics-20250831T034916Z-1-001\Statistics\student_data.csv")
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


# 1) Measure of Central Tendacy

It summarize large datasets into a single representative value, helping AI models understand data distribution.

Techniques (Mean, Median, Mode)

### Mean

Average Value Sensitive to Outliers

In [36]:
mn=np.mean(df['age'])
print(mn)

16.696202531645568


### Median
Middle Value in an ordered Numemric Sequence

In [37]:
md=np.median(df['age'])
print(md)

17.0


### Mode
The most frequent value in the dataset.

In [38]:
mo=df['age'].mode()[0]
print(mo)

16


### Visualization

In [40]:
sns.histplot(x='age',data=df,bins=[i for i in range(0,22,1)],color='green')
plt.plot([mn for i in range(0,21)],[i for i in range(0,21)], c='red',label='mean')
plt.plot([md for i in range(0,21)],[i for i in range(0,21)], c='blue',label='median')
plt.plot([mo for i in range(0,21)],[i for i in range(0,21)], c='yellow',label="mode")
plt.legend()
plt.show()

TypeError: 'int' object is not callable

# 2) Measure of Variability

A measure of variability (or dispersion) describes how spread out or scattered the data values are around the center (mean/median). It shows the degree to which data points differ from each other.

* ## Common Measures of Variability

### Range
Difference between maximum and minimum values.


In [None]:
min_r=df['G1'].min()
max_r=df['G1'].max()

range=max_r - min_r
range

In [None]:
class_1=np.array([75,65,73,68,72,76])
class_2=np.array([90,47,43,96,93,51])
no = [1,2,3,4,5,6]

### Standard Deviation
It is a measure of amount of variation or dispersion of set of values. A low standard deviation indicates the value tends to be close to mean and higher S.D tells that values are spread out over a wide range.

Example:

In [None]:
np.std(class_1),np.std(class_2)

#### Variance
Average of squared differences from the mean.


In [None]:
np.var(class_1), np.var(class_2)

### Mean Absolute Deviation

![image.png](attachment:058e1298-244f-4470-bbb9-f35304901e20.png)

The mean Absolute deviation of a dataset is the average distance between each data point and the mean. It gives us idea about the dispersion in a dataset.

Example:

In [None]:
mean=np.mean(class_1)
mean

In [None]:
class_1_mad=np.sum(np.abs(class_1-mean)/len(class_1))
class_2_mad=np.sum(np.abs(class_2-mean)/len(class_2))

In [None]:
class_1_mad,class_2_mad

### Visualization

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(class_1,no,label='Class 1')
plt.scatter(class_2,no, color='red',label='Class 2')
plt.legend()
plt.show()

# Percentiles & Quartiles 

### Percentiles:

Percentiles divide ordered data into 100 equal parts. The k-th percentile is the value below which k% of the data falls. 

Example: If a student’s test score is at the 90th percentile, they scored better than 90% of students.



In [None]:
q1 = np.percentile(df['age'], 25)
q2 = np.percentile(df['age'], 50)  
q3 = np.percentile(df['age'], 75)

print("Q1:", q1, "Q2:", q2, "Q3:", q3, "IQR:", q3-q1)

### Quartiles

* Quartiles divide ordered data into 4 equal parts (25% each).

* Q1 (25th percentile): 25% of data is below this value.

* Q2 (50th percentile / Median): Middle of the data.

* Q3 (75th percentile): 75% of data is below this value.

* IQR (Interquartile Range): Q3 − Q1 → measures spread of the middle 50% (helps detect outliers).

In [None]:
df.describe()

In [None]:
sns.boxplot(x='absences',data=df)
plt.show()

# Measure of Shapes

### 1.) Skewness

![image.png](attachment:93340bbb-9528-4704-9abe-68575fda4d34.png)

It tells where the distribution is symmetrical or tilted. There are two types of Skewness.

* Positive Skewness- Tail on the right (Mode < Median < Mode)
* Negative Skewness- Tail on the left (Mean < Median < Mode)

![1_nj-Ch3AUFmkd0JUSOW_bTQ.jpg](attachment:9de5a575-7899-49a2-8ec1-f088fe7928ee.jpg)

### Positive Skew

In [None]:
df['age'].skew()

In [None]:
sns.displot(df['age'])
plt.show()

### Negative Skew

In [None]:
data= np.random.normal(0,100,100)
df_2=pd.DataFrame({"x":data})
df_2['x'].skew()

In [None]:
sns.histplot(df_2['x'])
plt.show()

In [None]:
df_2['x'].mean(),df_2['x'].median()

### Symmetric Distribution

In [None]:
data_2=[2,3,3,4,4,5,5,5,5,6,6,6,6,6,7,7,7,7,7,8,8,8,8,8,9,9,9,9,10,10,11,11,12]
df_3=pd.DataFrame({"x":data_2})
df_3['x'].skew()

In [None]:
sns.histplot(x='x',data=df_3,bins=[2,3,4,5,6,7,8,9,10,11,12,13])
plt.show()

In [None]:
print(df_3['x'].mean(), df_3['x'].median(), df_3['x'].mode())

# Probability

Probability measures the likelihood of a particular outcome or event occuring. It is typically expressed as a number between 0 and 1, where 0 indicates impossiblity (event will not occur) and 1 indicates event certainity (event will occur).

* P(A) = Number of times A occur /Total number of possible outcomes

## Random Variable

A random variable is a variable whose possible values are outcomes of a random experiment. It assigns a numeric value to each outcome.

### Types of Random Variable
#### 1.) Discrete Random Variable:

Takes countable values (finite or infinite).

Example: Number of students in a class, number of heads in 10 coin tosses.

#### 2.) Continuous Random Variable:

Takes any value within a range (infinite, uncountable).

Example: Height of students, time taken to complete a task, temperature.

# Probability Distribution 

Probability distribution describes how the probabilities of different outcomes are distributed over the sample space of random variable.

* Discrete Probability Distribution
* Continuous Probability Distribution

![image.png](attachment:c8bb4c18-2398-492c-978f-a77829729506.png)

![image.png](attachment:03e8557b-8797-443b-9409-4f7e349ea624.png)


How Probability Distributions Shape Our World: A Dive into Normal, Uniform, Binomial, Poisson, and Lognormal Distributions

## Probability Distribution Function

It is a mathematical function that gives the probabilities of occurence of different possible outcomes for an expirement.

* Probability Distributive Function (PDF)
* Probability Mass Function (PMF)
* Cumulative Density Function (CDF)

![image.png](attachment:eeb483a1-50fd-4402-bcf8-641ad76ab5d3.png)

## Normal Distribution

It is known as Gaussian Distribution, that is symmetic about the mean, showing that the data near the mean are more frequent in occurence than the data from the mean.

* Formula:
![image.png](attachment:dc402a6d-31a1-4e0b-896d-9d2bd196c1e1.png)

![image.png](attachment:6ef10a30-0fba-449b-998f-55f221ce58d3.png)

* In graph form normal distribution will appear as a bell curve

## Standard Normal Distribution

* The Standard normal distribution, as known as Z-distribution  or Z-score, is a special case of the normal distribution.
* mean(u) of 0 and a standard deviation of 1.

## Covariance & Correlation

* Covariance signifies the direction of the linear relationship between the two variables. By direction we mean if the variable is directly propotional or inversely propotional to each other.

  ![image.png](attachment:2ab4b94b-d95e-4390-b2d6-f8f991693b28.png)
* Increasing the value of one variable might have a positive or negative impact on the value of other variable.)

* x- Positive, y- Positive -> Positive Covariance/Correlation
* x- Negative, y- Positive -> Negative Covariance/Correlation
* x- Positive, y- Negative -> 0 Covariance/Correlation

### Correlation
* Correlation analysis is a method of statistical evaluation used to study the strength of a relationship between two, numerically measured, continuous variables.
  ![image.png](attachment:e87544f1-95cb-4d4d-aa73-6c9ad98373ad.png)

* where Cov is the covariance

* varianc x is  the standard Deviation of x

* variance y  is the standard deviation of y

  ![image.png](attachment:adec4cd8-398b-4a2d-99c7-b3e3eb88a36e.png)

### Correlation Graph:

![image.png](attachment:c2cf1cc4-432f-4444-b5d0-f32d5203baa3.png)

## Pearson Correlation Graph

![image.png](attachment:a6e64a6c-7716-4c7d-b6c3-c2f3ac0f74ca.png)
![image.png](attachment:a5fb0902-bd60-44f6-a7ec-97c5b7566188.png)
* We notice that graph is getting scattered towards 0 and -0 values

  ![image.png](attachment:354788d8-db16-4da3-a91e-08eb7ab2cb6c.png)

### Example

In [None]:
df.head(3)

In [None]:
data_corr=df.select_dtypes(include= ['int']).corr()
data_corr

In [None]:
plt.figure(figsize=(14,8))
sns.heatmap(data_corr, annot=True, cmap='mako')
plt.show()

In [None]:
data_cov=df.select_dtypes(include= ['int']).cov()

In [None]:
plt.figure(figsize=(14,8))
sns.heatmap(data_cov, annot=True)
plt.show()