In [None]:
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the required datasets for the analysis
batting_file = './baseball/core/Batting.csv'
salaries_file = './baseball/core/Salaries.csv'
batting_post_file = './baseball/core/BattingPost.csv'
awards_players_file = './baseball/core/AwardsPlayers.csv'
pitching_file = './baseball/core/Pitching.csv'
people_file = './baseball/core/People.csv'

# Load the CSV files into pandas dataframes
batting_data = pd.read_csv(batting_file)
salaries_data = pd.read_csv(salaries_file)
batting_post_data = pd.read_csv(batting_post_file)
awards_players_data = pd.read_csv(awards_players_file)
pitching_data = pd.read_csv(pitching_file)
people_data = pd.read_csv(people_file)

**Analysis 1:** Player Performance over Time (Batting Statistics)

In [None]:
# Aggregate batting statistics (HR and AVG) by year
batting_data['AVG'] = batting_data['H'] / batting_data['AB']  # Calculate Batting Average
batting_performance = batting_data.groupby('yearID').agg({'HR': 'sum', 'AVG': 'mean'}).reset_index()

fig, ax1 = plt.subplots(figsize=(12, 6))

# Plot Total Home Runs on the primary y-axis
ax1.set_xlabel('Year')
ax1.set_ylabel('Total Home Runs', color='blue')
ax1.plot(batting_performance['yearID'], batting_performance['HR'], color='blue', label='Total Home Runs')
ax1.tick_params(axis='y', labelcolor='blue')

# Create a secondary y-axis for the Batting Average
ax2 = ax1.twinx()
ax2.set_ylabel('Batting Average', color='green')
ax2.plot(batting_performance['yearID'], batting_performance['AVG'], color='green', label='Batting Average')
ax2.tick_params(axis='y', labelcolor='green')

# Set the title and display the plot
plt.title('Player Performance Over Time (Home Runs and Batting Average)')
fig.tight_layout()
plt.show()

This graph represents two key performance metrics in baseball—Total Home Runs and Average Batting Average—over time. Here's how to interpret it:

**Blue Line: Total Home Runs (Left Y-Axis):**
The blue line shows the total number of home runs hit by players in each year.
The left y-axis corresponds to the number of home runs.
We observe that the total number of home runs gradually increases over time, indicating that players have been hitting more home runs in recent years. This could be due to changes in player strength, training, or strategies favoring power-hitting.

**Green Line: Average Batting Average (Right Y-Axis):**
The green line represents the average batting average across all players for each year.
The right y-axis corresponds to the batting average, which typically ranges between 0 and 1.
Unlike the steady rise in home runs, the batting average shows less variation and remains relatively stable over time. The batting average has stayed in a consistent range between 0.178 and 0.255, with slight fluctuations. This suggests that while players may be hitting more home runs, their ability to hit overall (as measured by batting average) has not dramatically changed.

**Key Observations:**

*   **Increasing Home Runs:** The sharp upward trend in home runs highlights a significant change in player performance or strategies, especially in the modern era.
*   **Stable Batting Average:** Batting average does not show the same dramatic rise, indicating that while power hitting has increased, the overall ability to hit has remained relatively steady.
This graph illustrates how the nature of offensive performance in baseball has evolved, with a growing emphasis on hitting home runs, while general hitting consistency (batting average) has not changed as much.

**Analysis 2:** Comparing Salaries and Age

In [None]:
# Calculate the correlation between player age and salary
# Using the merged data of salaries and player ages calculated earlier (batting and pitching data)

# Merge the salary data with the player age information (ensure birthYear is already calculated)
salary_age_data = pd.merge(salaries_data, people_data[['playerID', 'birthYear']], on='playerID', how='left')
salary_age_data['age'] = salary_age_data['yearID'] - salary_age_data['birthYear']

# Remove any missing values or unrealistic ages
salary_age_data = salary_age_data[(salary_age_data['age'] > 15) & (salary_age_data['age'] < 60)]  # Age filter

# Calculate the correlation between age and salary
correlation = salary_age_data['age'].corr(salary_age_data['salary'])

# Plot salary vs age
plt.figure(figsize=(8, 6))
plt.scatter(salary_age_data['age'], salary_age_data['salary'], alpha=0.5, color='blue')
plt.title(f'Salary vs Age (Correlation: {correlation:.2f})')
plt.xlabel('Age')
plt.ylabel('Salary ($)')
plt.grid(True)
plt.show()

correlation

This scatter plot shows the relationship between a baseball player's age (on the x-axis) and their salary (on the y-axis). Each point represents an individual player’s salary at a specific age.

**Key Observations:**
1. **Correlation (0.33):**
* The correlation value of 0.33 indicates a positive but moderate relationship between age and salary.
*   This means that, generally, as players get older, their salaries increase, but the relationship is not very strong. Other factors beyond age (such as performance, experience, team contracts, etc.) also play a significant role in determining salaries.

2. Peak Salary Age:
*   The graph shows that salaries peak around the ages of 30 to 35. This is the period when most players are at their prime in terms of experience, physical condition, and performance.
*   Players tend to reach their highest earning potential around this age range due to long-term contracts, endorsements, and their established value within teams.

3. **Salary Decline After Age 35:**
*   After the age of 35, the number of players earning high salaries begins to decline. This is expected, as players usually start to decline in performance and physical capability in their later years.
*   Fewer players in the 40+ age group command the same high salaries, although some high-performing veterans might still be exceptions.

4. **Low Salaries for Younger Players:**
*   Players below the age of 25 generally have lower salaries, as they are often newer to the league and still proving themselves. Entry-level contracts or minor league players in their early years may also contribute to this trend.
*   As players gain experience and prove their worth, their salaries tend to rise, as shown by the upward trend into the 30–35 age range.

5. **Outliers:**
*   There are a few players in their late 20s and early 30s earning extremely high salaries (above $30 million), which are outliers. These could be star players with exceptional contracts.

**Key Insights:**
* Salaries increase with age up to a certain point (around 30–35 years), which is typically when players reach their peak performance.
* After this peak, salaries decline as players age, aligning with the natural decline in physical ability and performance.
* The positive correlation (0.33) suggests that age is one of several factors influencing player salaries, but it is not the only determinant.

**Analysis 3:** Home runs and strikeouts vs age

In [None]:
# Merge with the 'People' dataset to get birth dates
batting = pd.merge(batting_data, people_data[['playerID', 'birthYear']], on='playerID', how='left')
pitching = pd.merge(pitching_data, people_data[['playerID', 'birthYear']], on='playerID', how='left')

# Calculate the player's age during the season
batting['age'] = batting['yearID'] - batting['birthYear']
pitching['age'] = pitching['yearID'] - pitching['birthYear']


In [None]:
# Aggregate batting metrics by age
batting_by_age = batting.groupby('age').agg({
    'HR': 'sum',       # Total Home Runs
    'RBI': 'sum',      # Total Runs Batted In
}).reset_index()

batting_by_age.head()

In [None]:
# Aggregate pitching metrics by age
pitching_by_age = pitching.groupby('age').agg({
    'SO': 'sum',       # Total Strikeouts
    'ERA': 'mean',     # Average ERA
}).reset_index()

# Check the results
print(pitching_by_age.head())

In [None]:
# Plot Home Runs by Age
sns.lineplot(x='age', y='HR', data=batting_by_age, marker='o')
plt.title('Home Runs by Age')
plt.xlabel('Age')
plt.ylabel('Home Runs')
plt.grid(True)
plt.show()

# Plot Strikeouts by Age
plt.figure()
sns.lineplot(x='age', y='SO', data=pitching_by_age, marker='o')
plt.title('Strikeouts by Age')
plt.xlabel('Age')
plt.ylabel('Strikeouts')
plt.grid(True)
plt.show()

**Home Runs (HR):**
Home Runs (HR) represent the total number of times a player hits the ball out of the park in fair territory, scoring a run. Home run ability is often associated with power hitting. </br>
**Strikeouts (SO):**
Strikeouts (SO) represent the number of times a pitcher throws three strikes to a batter, resulting in the batter's dismissal without making contact that leads to a hit. It’s a critical measure of a pitcher's power, control, and effectiveness.

**Plot Interpretation:**

The line plots of HR by Age and SO by age would depict how home run production changes with age and how the number of strikeouts a pitcher records changes as they age respectively.

*Early Career (Early 20s):* Home run totals may start low as players develop their strength and adjust to major league pitching. Aslo, at the satrting of the pitcher's carrer it strikeouts was low with an increasing trends.

*Late 20s to Early 30s:* Players often reach their peak power during these years, leading to the highest home run totals same as the strikouts.

*Decline Phase (Mid 30s and Beyond):* As players age, power tends to decline, leading to fewer home runs and strikeouts. This can be due to decreased physical strength, slower bat speed, and the natural aging process.

**Overall Insights:**

The peak age for home runs and strikeouts show when a player's power typically reaches its maximum.
The decline in home runs and strikeouts can inform strategies around player usage, such as moving aging sluggers into roles where they can still contribute without needing to play every day.

**Analysis 4:** Distribution of Age at Debut with Density Plot


In [None]:
# Merge the birth year with the batting data to calculate debut age
batting_data = pd.merge(batting_data, people_data[['playerID', 'birthYear']], on='playerID', how='left')
# Calculate the player's age at debut
batting_data['age_at_debut'] = batting_data['yearID'] - batting_data['birthYear']

# Filter out invalid values (valid ages should be between 15 and 40)
batting_data = batting_data[(batting_data['age_at_debut'] > 15) & (batting_data['age_at_debut'] < 40)]

# Plot histogram with a density plot overlay
plt.figure(figsize=(10, 6))
plt.hist(batting_data['age_at_debut'], bins=20, color='lightgreen', edgecolor='black', alpha=0.7, density=True)
batting_data['age_at_debut'].plot(kind='kde', color='blue', linewidth=2)

# Calculate summary statistics
mean_age = batting_data['age_at_debut'].mean()
median_age = batting_data['age_at_debut'].median()
std_dev_age = batting_data['age_at_debut'].std()

# Add the vertical lines for mean and median
plt.axvline(mean_age, color='red', linestyle='dashed', linewidth=1, label=f'Mean: {mean_age:.2f}')
plt.axvline(median_age, color='green', linestyle='dashed', linewidth=1, label=f'Median: {median_age:.2f}')
plt.axvline(mean_age + std_dev_age, color='blue', linestyle='dashed', linewidth=1, label=f'Std Dev: {std_dev_age:.2f}')

# Set the x-axis limits to zoom in on valid debut ages
plt.xlim(15, 40)

# Add a legend with a box in the upper right corner
plt.legend(loc='upper right', frameon=True, fontsize=10, title='Statistics')

# Improve title, labels, and grid
plt.title('Distribution of Age at Debut with Density Plot', fontsize=14)
plt.xlabel('Age at Debut', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.tight_layout()
plt.show()

This graph represents the distribution of ages at which baseball players made their Major League debut. It provides both a histogram of the debut ages and a density plot for smoother interpretation.

**Key Components:**

**Histogram (Light Green Bars):**
* The bars show the frequency (or density) of players debuting at different ages.
* Most players make their debut between the ages of 20 and 30, with a clear peak around 24 to 26 years old.
* This is the typical age range for players entering the major leagues after gaining some experience and training.

**Density Plot (Blue Line):**
* The blue line provides a smoothed version of the histogram, making it easier to observe the overall trend.
* The density plot highlights the peak debut age and the gradual decline after the age of 30.

**Vertical Lines:**
* **Red Dashed Line:** Represents the mean age of 28.08 years, showing the average debut age across all players.
* **Green Dashed Line:** Represents the median age of 28.00 years, indicating that half of the players debut before this age, and half debut after.
* **Blue Dashed Line:** Shows one standard deviation from the mean, which is 4.15 years. This gives a sense of the spread or variability in debut ages.

**Legend:**
* The legend in the top right explains the vertical lines and their significance. It helps clarify the summary statistics, making the graph easier to interpret.

**Key Insights:**
* **Debut Age Distribution:** The most common debut age is between 24 and 26, which is expected as players usually develop their skills in the minor leagues or college before entering the majors.
* **Peak and Decline:** After the age of 30, the number of debuting players drops significantly, which aligns with expectations since players tend to enter the league earlier in their careers.
* **Variation:** The standard deviation (4.15 years) indicates a moderate variation in debut ages, but most players debut around their mid-20s.

This graph provides a clear visual summary of how player ages at debut are distributed, with most players entering the major leagues in their mid-20s.