# Project 1: Analysis of Student Count Distribution Across School Years





##  Look at the data
I first open `dataset.csv`. I check which columns are numeric and select Count of Students as the target for the rest of the notebook.

In [None]:
# open dataset.csv and select the "Count of Students" column
import pandas as pd

df = pd.read_csv("data/dataset.csv")

data_select = df["Count of Students"]
data_select 


0            22.0
1          2017.0
2          9016.0
3         29113.0
4           429.0
           ...   
302083        NaN
302084        NaN
302085        NaN
302086        NaN
302087        NaN
Name: Count of Students, Length: 302088, dtype: float64

### Which columns are numeric?
I ask pandas to tell me. Then I lock in the target column. I drop missing values and keep only valid numbers so the results are clean.

In [11]:
#  Clean the data
data_select = data_select.dropna()
data_select = pd.to_numeric(data_select, errors='coerce')
data_select = data_select [data_select >= 0]
data_select

#  Look at the data after cleaning
data_select.head()
data_select.describe()

count    60655.000000
mean       122.194279
std        835.881400
min         10.000000
25%         13.000000
50%         21.000000
75%         49.000000
max      39376.000000
Name: Count of Students, dtype: float64

## Pandas route: mean, median, and mode
This is the straightforward path. I use the built-in methods.

- **Mean**: sum of all values divided by the number of values.
- **Median**: the middle value after sorting; for an even count, average the two middle values.
- **Mode**: the most frequent value(s). There can be more than one.

In [14]:
# Quick summary with pandas

mean_value = data_select.mean()
median_value = data_select.median()
mode_value = data_select.mode().iloc[0]   # mode() returns a Series; take the first value

print("Mean:", mean_value)
print("Median:", median_value)
print("Mode:", mode_value)


Mean: 122.19427911961091
Median: 21.0
Mode: 10.0


## 2) Standard‑library route (no pandas, no `statistics`)
Now I re‑do the same work without pandas. I read the CSV using `csv.DictReader`, pull out the target column, and implement the three statistics from scratch. This makes the recipes explicit and shows what the library functions are doing under the hood.

In [None]:
# First, turn the cleaned data into a list
values = list(data)

# ---- Mean ----
total = 0
for x in values:
    total = total + x
mean_value_py = total / len(values)

# ---- Median ----
values_sorted = sorted(values)
length = len(values_sorted)

if length % 2 == 1:
    # If the number of values is odd, take the middle one
    median_value_py = values_sorted[length // 2]
else:
    # If even, take the average of the two middle numbers
    mid1 = values_sorted[length // 2 - 1]
    mid2 = values_sorted[length // 2]
    median_value_py = (mid1 + mid2) / 2

# ---- Mode ----
counts = {}
for x in values_sorted:
    if x not in counts:
        counts[x] = 1
    else:
        counts[x] = counts[x] + 1

# The mode is the value with the highest count
mode_value_py = max(counts, key=counts.get)

print("Mean (Python):", mean_value_py)
print("Median (Python):", median_value_py)
print("Mode (Python):", mode_value_py)

Here I calculated the mean, median, and mode using only basic Python：

To find the mean, I added all the numbers together and divided by how many numbers there are.

To find the median, I sorted the list and looked for the value in the middle.

To find the mode, I counted how many times each number appears and picked the one with the highest count.

## Histogram
I create a simple vertical text histogram with 10 bins. It is minimal but useful. It shows where values are concentrated and whether the distribution looks symmetric or skewed.

In [22]:
# Turn the cleaned data into a list
values = list(data_select)

# Find the minimum and maximum values
min_values = min(values)
max_values = max(values)

# Divide the data range into 10 bins 
bins = 10
step = (max_values - min_values) / bins

# Count how many values fall into each bin
histogram_dataset = [0] * bins

for v in values:
    index = int((v - min_values) / step)
    if index == bins:   # avoid going out of range
        index = bins - 1
    histogram_dataset[index] = histogram_dataset[index] + 1

# Print a simple text histogram
print("Text Histogram for Count of Students:\n")
for i in range(bins):
    bar = "***" * (histogram_dataset[i] // 10)  
    print(f"Bin {i+1:02d}: {bar}")


Text Histogram for Count of Students:

Bin 01: *****************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

When I plotted the histogram, only the first few bins show bars.

This means most schools have relatively low numbers in the “Count of Students” column, and only very few schools have very high counts.

In other words, the distribution is right-skewed: many low values and fewer high values.

## Year Summary Table
To show how the values change by year, I created a sparkline using text characters. I grouped the data by year and summed the Count of Students. 

In [None]:

year_summary = df.groupby("Year")["Count of Students"].sum()

# Turn it into a table for display
year_table = year_summary.reset_index()
year_table.columns = ["Year", "Total Count of Students"]
year_table



Unnamed: 0,Year,Total Count of Students
0,2010-2011,693136.0
1,2011-2012,394215.0
2,2012-2013,358407.0
3,2013-2014,598588.0
4,2014-2015,587910.0
5,2015-2016,621062.0
6,2016-2017,587175.0
7,2017-2018,609724.0
8,2018-2019,613852.0
9,2019-2020,654263.0


From this table we can see that the total number of students remains relatively stable over the years, with some small increases and decreases.
The values do not change dramatically, which suggests that the overall student population size remained consistent during this time period.

## Conclusion

In this analysis, I cleaned the dataset to remove missing and invalid values, and then focused on the column Count of Students. I calculated the mean, median, and mode using both pandas and basic Python code to show how these statistics work. I also created simple text-based visualizations to show how the data is distributed and to compare totals across school years.

From the results, we can see that most values in Count of Students are relatively small, which means the distribution is heavily concentrated at the lower end. When the data is summarized by year, the total student counts appear to remain fairly stable, with minor increases and decreases across years. Overall, this suggests that the size of the student population is consistent over time, without major fluctuations.