 # Data Analysis and Statistical Calculations: Fundamental Concepts and Practical Applications

One of the basic tools used when analyzing data is statistical calculations. Statistical calculations are used to understand properties of data, recognize patterns, and interpret results.

In this article, we will examine statistical calculations, one of the cornerstones of the data analysis process. You will learn how to perform basic statistical calculations as a data analyst using the Pandas library. You'll also learn how you can use these calculations to extract meaningful information from your data sets.

## Data Analysis and Statistical Calculations

The basis of the data analysis process is understanding the statistical properties of the data set and interpreting the data using these properties. Here are the basic statistical calculations frequently used during data analysis:

#### 1.describe()
#### 2.mean():
#### 3.median()
#### 4.mode()
#### 5.sum()
#### 6.min()
#### 7.max()
#### 8.std()
#### 10.var()
#### 11.count()
#### 12.quantile()
#### 13.corr()

## For Example

Below is an example showing how to use the statistical calculations listed above.

In [107]:

import pandas as pd

# Creating a sample data frame
data = {
    'Student': ['John', 'Jane', 'Michael', 'David', 'Evelyn', 'Mustafa', 'Emily', 'John', 'Denise', 'Grace',
                'Ibrahim', 'Eliza', 'Bruce', 'Fatima', 'Emir', 'Catherine', 'Rose', 'Kadir', 'Irene', 'Oscar'],
    'Course': ['Mathematics', 'Physics', 'Chemistry', 'Mathematics', 'Physics', 'Chemistry', 'Mathematics', 'Physics', 'Chemistry',
               'Mathematics', 'Physics', 'Chemistry', 'Mathematics', 'Physics', 'Chemistry', 'Mathematics', 'Physics', 'Chemistry',
               'Mathematics', 'Physics'],
    'Grade': [90, 85, 95, 75, 80, 92, 87, 88, 93, 82, 79, 91, 86, 89, 78, 83, 94, 81, 84, 90],
    'StudyHours': [10, 12, 15, 9, 8, 16, 14, 11, 17, 9, 10, 13, 12, 15, 8, 11, 14, 9, 10, 16]
}

df = pd.DataFrame(data)

# Displaying the first 5 rows of the data frame
print(df.head())


   Student       Course  Grade  StudyHours
0     John  Mathematics     90          10
1     Jane      Physics     85          12
2  Michael    Chemistry     95          15
3    David  Mathematics     75           9
4   Evelyn      Physics     80           8


## 1. describe() :

This function shows basic statistical values ​​(count, mean, std, min, max, etc.) and distribution for numerical columns of the data set.

In [106]:
summary = df.describe()
summary

Unnamed: 0,Grade,StudyHours
count,20.0,20.0
mean,86.1,11.95
std,5.739063,2.874113
min,75.0,8.0
25%,81.75,9.75
50%,86.5,11.5
75%,90.25,14.25
max,95.0,17.0


## 2. mean():

Calculates the average of numeric values in a column.

In [104]:
# Average of grade column
mean = df['Grade'].mean()
print(f"Average of numerical values in the column: {mean}")

Average of numerical values in the column: 86.1


## 3.median():

Calculates the median value of numeric values in a column.

In [105]:
# Median of grade column
median = df['Grade'].median()
print(f"Median of numeric values in a column: {median}")

Median of numeric values in a column: 86.5


## 4.mode():

Calculates the mode (most frequent value) of numeric values in a column.

In [101]:
# Mode of Grade column
mode = df['Grade'].mode()
print(f"Mode of numeric values in column: {mode.values[0]}")

Mode of numeric values in column: 90


In [102]:
grade_counts = df['Grade'].value_counts()
print(grade_counts)

90    2
79    1
81    1
94    1
83    1
78    1
89    1
86    1
91    1
82    1
85    1
93    1
88    1
87    1
92    1
80    1
75    1
95    1
84    1
Name: Grade, dtype: int64


## 5.sum():

Calculates the sum of numeric values in a column.

In [100]:
# Total of grade column
total = df['Grade'].sum()
print(f"Sum of numeric values in the column: {total}")

Sum of numeric values in the column: 1722


## 6.min():

Finds the smallest numerical value in a column.

In [99]:
# Minimum value of grade column
min_value = df['Grade'].min()
print(f"The smallest of the numeric values ​​in the column: {min_value}")

The smallest of the numeric values ​​in the column: 75


## 7.max():

Finds the largest numerical value in a column.

In [98]:
# Maximum value of grade column
max_value = df['Grade'].max()
print(f"The largest of the numerical values in the column: {max_value}")

The largest of the numerical values in the column: 95


## 8.std():

Calculates the standard deviation of numeric values in a column.

In [97]:
# Standard deviation of grade column
std = df['Grade'].std()
print(f"Standard deviation of numerical values in the column: {std}")

Standard deviation of numerical values in the column: 5.739062824648564


## 10.var():

Calculates the variance of numeric values in a column.

In [96]:
# Variance of grade column
var = df['Grade'].var()
print(f"Variance of numeric values in column: {var}")

Variance of numeric values in column: 32.93684210526315


## 11.count():

Calculates the total number of records in a column.

In [95]:
# Total number of values in the grade column
count = df['Grade'].count()
print(f"Total number in column: {count}")

Total number in column: 20


## 12.quantile():

Calculates specific percentiles of numeric values in a column.

In [93]:
# Specific percentiles of numerical values in the grade column
quantiles = df['Grade'].quantile([0.25, 0.50, 0.75])
print("Percentiles:")
print(quantiles)

Percentiles:
0.25    81.75
0.50    86.50
0.75    90.25
Name: Grade, dtype: float64


## 13.corr():

Calculates the correlation matrix between columns.

In [94]:
# Calculate correlation between Grade and StudyHours
correlation = df['Grade'].corr(df['StudyHours'])
print(f"Correlation (Grade and StudyHours): {correlation}")

Correlation (Grade and StudyHours): 0.83312167291602
