
# PROGRAMMING LANGUAGE TRENDS — STACK OVERFLOW TAGS OVER TIME

This notebook analyzes **Stack Overflow** tag usage over time to approximate the relative popularity of major programming languages.
We clean the CSV export, reshape it to a time series by language, explore key patterns, and smooth the series with a rolling mean.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

pd.options.display.float_format = '{:,.0f}'.format


## Get the Data

Use the provided `QueryResults.csv` or export your own from Stack Exchange Data Explorer (same schema):  
- Query link: https://data.stackexchange.com/stackoverflow/query/675441/popular-programming-languages-per-over-time-eversql-com

Expected CSV schema (header row present):
- `DATE` — month (e.g., 2008-09-01 00:00:00)  
- `TAG` — programming language tag (e.g., `python`, `java`)  
- `POSTS` — number of posts with that tag in the month


## Load & Inspect Data

In [None]:
# Read CSV (header=0 for the first row as header). If your file has different headers, adjust 'names' accordingly.
df = pd.read_csv('QueryResults.csv', header=0, names=['DATE', 'TAG', 'POSTS'])
print("Shape:", df.shape)
print(df.head())
print(df.tail())

print("\nColumn counts:")
print(df.count())

## Clean Dates & Types

In [None]:
# Convert DATE to datetime and POSTS to numeric (defensive conversion)
df['DATE'] = pd.to_datetime(df['DATE'], errors='coerce')
df['POSTS'] = pd.to_numeric(df['POSTS'], errors='coerce')

# Drop rows with missing key fields
df = df.dropna(subset=['DATE', 'TAG', 'POSTS']).copy()

print("Date range:", df['DATE'].min(), "→", df['DATE'].max())
print("Unique tags:", df['TAG'].nunique())

# Months of data per language
months_per_tag = df.groupby('TAG')['POSTS'].count().sort_values()
months_per_tag.head(), months_per_tag.tail()

## Reshape to Time Series by Language

In [None]:
reshaped_df = df.pivot(index='DATE', columns='TAG', values='POSTS')
print("Reshaped shape:", reshaped_df.shape)
print(reshaped_df.head())
print(reshaped_df.tail())

print("\nEntries per language after pivot (non-null counts):")
print(reshaped_df.count())

# Fill missing months with 0 posts for consistency
reshaped_df = reshaped_df.fillna(0).sort_index()
reshaped_df.head()

## Quick Plot: Single Language

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(reshaped_df.index, reshaped_df['java'])
plt.title('Monthly Posts: java')
plt.xlabel('Date')
plt.ylabel('Number of Posts')
plt.show()

## Compare Two Languages

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(reshaped_df.index, reshaped_df['java'], linewidth=2, label='java')
plt.plot(reshaped_df.index, reshaped_df['python'], linewidth=2, label='python')
plt.xlabel('Date')
plt.ylabel('Number of Posts')
plt.title('Monthly Posts: Java vs Python')
plt.legend()
plt.show()

## All Languages (Raw Monthly Posts)

In [None]:
plt.figure(figsize=(16, 10))
for column in reshaped_df.columns:
    plt.plot(reshaped_df.index, reshaped_df[column], linewidth=2, label=column)

plt.xlabel('Date')
plt.ylabel('Number of Posts')
plt.title('Stack Overflow Monthly Posts by Language Tag')
plt.legend(fontsize=9, ncol=2)
plt.show()


# Smoothing with Rolling Mean

Time series can be noisy. We’ll smooth using a **6‑month rolling mean** to better see medium-term trends.


In [None]:
roll_df = reshaped_df.rolling(window=6).mean()

plt.figure(figsize=(16, 10))
for column in roll_df.columns:
    plt.plot(roll_df.index, roll_df[column], linewidth=2, label=column)

plt.xlabel('Date')
plt.ylabel('Number of Posts (6-mo avg)')
plt.title('Stack Overflow Language Trends — 6‑Month Rolling Mean')
plt.legend(fontsize=9, ncol=2)
plt.show()

## Totals & Rankings

In [None]:
# Total posts per language across the entire period
totals = reshaped_df.sum().sort_values(ascending=False)
print("Top languages by total posts:")
print(totals.head(10))


## Conclusion

- The **pivoted time series** makes it easy to compare languages over time.
- **Rolling averages** reveal medium-term trends that raw monthly series can hide.
- Totals across the full period highlight the **most discussed languages** overall, but recency trends may differ.
- For a fairer comparison across time, you might normalize by **total Stack Overflow activity** per month or compute **share of posts** per language.
