# Summer of Code - Artificial Intelligence
## Week 02: Descriptive Statistics and Probability
### Day 04: Measure of Central Tendency

In this notebook, we will learn about **Data, its types, and Measures of Central Tendency** using Python.

# What is Statistics?
Statistics is the science of collecting, organizing, analyzing, and interpreting data.
- By collection we mean gathering data from various sources.
- By organization we mean arranging the data in a systematic way.
- By analysis we mean examining the data to find patterns and relationships.
- By interpretation we mean making sense of the data and drawing conclusions.

## Descriptive Statistics
Descriptive statistics deals with summarizing and describing the main features of a data. It provides simple summaries about the sample and the measures. Descriptive statistics is used to present quantitative descriptions in a manageable form.

# Data
Data is a collection of information that can be analyzed and used to make decisions.
- Data can be in the form of numbers, text, images, audio, video, etc.
- Data can be structured (organized in a specific format) or unstructured (not organized in a specific format).

## Types of Data
1. **Quantitative Data**: Numerical data that can be measured and counted.
   - Examples: Height, weight, age, income, temperature.
2. **Qualitative Data**: Categorical data that describes characteristics or attributes
    - Examples: Gender, race, marital status, education level, eye color.

### Quantitative Data
Quantitative data can be further classified into two types:
1. **Discrete Data**: Data that can take on only specific values (usually whole numbers).
   - Examples: Number of students in a class, number of cars in a parking lot.
2. **Continuous Data**: Data that can take on any value within a range (including fractions and decimals).
    - Examples: Height, weight, temperature, time.

### Qualitative Data
Qualitative data can be further classified into two types:
1. **Nominal Data**: Data that can be categorized but not ordered.
   - Examples: Gender, race, marital status, education level, eye color.
2. **Ordinal Data**: Data that can be categorized and ordered.
   - Examples: Education level (high school, bachelor's, master's, doctorate), customer satisfaction (very unsatisfied, unsatisfied, neutral, satisfied, very satisfied).

In [2]:
# Categorical/Qualitative
fruits = ["Apple", "Banana", "Cherry", "Apple", "Date", "Banana", "Banana", "Grape"]

Apple = 0
Banana = 0
Cherry = 0
Date = 0
Grape = 0

for fruit in fruits:
    if fruit == "Apple":
        Apple += 1
    elif fruit == "Banana":
        Banana += 1
    elif fruit == "Cherry":
        Cherry += 1
    elif fruit == "Date":
        Date += 1
    elif fruit == "Grape":
        Grape += 1

print(Apple)
print(Banana)
print(Cherry)
print(Date)
print(Grape)

2
3
1
1
1


In [3]:
fruit_counts = {
    "Apple": 0,
    "Banana": 0,
    "Cherry": 0,
    "Date": 0,
    "Grape": 0,
}

for fruit in fruits:
    fruit_counts[fruit] += 1

fruit_counts

{'Apple': 2, 'Banana': 3, 'Cherry': 1, 'Date': 1, 'Grape': 1}

In [4]:
# name, class, marks
row1 = ["Ahmad", "Physics", 90]
row2 = ["Ali", "Chemistry", 90]

In [8]:
table = [
    ["Ahmad", "Physics", 90],
    ["Ali", "Chemistry", 90],
    ["Ali", "Chemistry", 90]
    ]

table[0]

['Ahmad', 'Physics', 90]

In [9]:
table[0][0]

'Ahmad'

# Measures of Central Tendency
Measures of central tendency are statistical measures that describe the center or average of a dataset. The three most common measures of central tendency are:
1. **Mean**: The mean is the sum of all the values in a dataset divided by the number of values. It is also known as the average, calculated as:
$$\text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n}$$


In [20]:
numerical_data = [12, 19, 30, 25, 21, 10, 200]

summ = 0
for number in numerical_data:
    summ += number
    
average = summ / len(numerical_data)
average

45.285714285714285

In [21]:
sum(numerical_data) / len(numerical_data)

45.285714285714285


where $x_i$ represents each value in the dataset and $n$ is the total number of values.

2. **Median**: The median is the middle value in a dataset when the values are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values.


In [None]:
numerical_data = [12, 19, 30, 25, 21, 10, 200]

numerical_data.sort() # changes original data
numerical_data

[10, 12, 19, 21, 25, 30]

In [22]:
numerical_data = [12, 19, 30, 25, 21, 10, 200]

numerical_data_sorted = sorted(numerical_data)

numerical_data_sorted

[10, 12, 19, 21, 25, 30, 200]

In [17]:
numerical_data

[12, 19, 30, 25, 21, 10]

In [23]:
if len(numerical_data_sorted) % 2 == 0:
    idx1 = len(numerical_data_sorted) // 2
    idx2 = len(numerical_data_sorted) // 2 + 1
    median = (numerical_data_sorted[idx1] + numerical_data_sorted[idx2]) / 2
else:
    idx = (len(numerical_data_sorted) + 1) // 2
    median = numerical_data_sorted[idx]

print(median)

25


3. **Mode**: The mode is the value that appears most frequently in a dataset. A dataset may have one mode, more than one mode, or no mode at all.