## Project: Analyzing Baseball Data
### Author: SANDHYA S
### Date: 12 February '21
### Course: Python Data Analysis

### Description: 
Provided several CSV files that contain data on the performance of Major League Baseball(MLB) player over a span of more than a century. This historical data can be found at seanlehman.com. This project focused on writing code that compute several common batting statistics from the data in the CSV files.

The info dictionaries contain the following keys, all of which are strings:

• masterfile: the name of the master CSV file that includes columns with player IDs and names.

• battingfile: the name of the CSV file thet includes columns with player IDs and batting data.

• separator: the delimiter character used in the two CSV files.

• quote: the quote character used in the two CSV files.

• playerid: the name ofthe column header for player IDs in both the master and batting CSV files.

• firstname: the name ofthe column header for player's first names in the master CSV file.

• lastname: the name ofthe column header for players last names in the master CSV file.

• yearid: the name of the column header for the year in the bating CSV file.

• atbats: the name of the column header for at-bats data in the batting CSV file.

• hits: the name of the column header for hits data in the batting CSV file.

• doubles: the name of the column header for doubles data in the betting CSV file.

• triples: the name of the column header for triples data in the batting CSV file.

• homeruns: the name ofthe column header for home runs data in the batting CSV file.

• walks: the name of the column header for walks data in the batting CSV file.

• battingfields: a list of column header names that correspond to batting data in che batting CSV file.


#### Part 1: Compute players with top batting statistics by year
The task is to write four functions that can be used in combination to compute the top players based on a provided statistical formula for a given year. These functions will select a subset of the data and compute the provided statistic on this data.

#### Part 2: Compute players with top batting statistics by career
The task is to wirite two more functions that can used along with the other four functions to compute the top players based on a provided statistical formula for their entre career. These functions will aggregate the yearly data in date that spans and player's career and then compute the provided statistic on this data.

In [1]:
"""
Project for Week 4 of "Python Data Analysis".
Processing CSV files with baseball stastics.
"""

import csv

##
## Provided code from Week 3 Project
##

def read_csv_as_list_dict(filename, separator, quote):
    """
    Inputs:
      filename  - name of CSV file
      separator - character that separates fields
      quote     - character used to optionally quote fields
    Output:
      Returns a list of dictionaries where each item in the list
      corresponds to a row in the CSV file.  The dictionaries in the
      list map the field names to the field values for that row.
    """
    table = []
    with open(filename, newline='') as csvfile:
        csvreader = csv.DictReader(csvfile, delimiter=separator, quotechar=quote)
        for row in csvreader:
            table.append(row)
    return table


def read_csv_as_nested_dict(filename, keyfield, separator, quote):
    """
    Inputs:
      filename  - name of CSV file
      keyfield  - field to use as key for rows
      separator - character that separates fields
      quote     - character used to optionally quote fields
    Output:
      Returns a dictionary of dictionaries where the outer dictionary
      maps the value in the key_field to the corresponding row in the
      CSV file.  The inner dictionaries map the field names to the
      field values for that row.
    """
    table = {}
    with open(filename, newline='') as csvfile:
        csvreader = csv.DictReader(csvfile, delimiter=separator, quotechar=quote)
        for row in csvreader:
            rowid = row[keyfield]
            table[rowid] = row
    return table

##
## Provided formulas for common batting statistics
##

# Typical cutoff used for official statistics
MINIMUM_AB = 500

def batting_average(info, batting_stats):
    """
    Inputs:
      batting_stats - dictionary of batting statistics (values are strings)
    Output:
      Returns the batting average as a float
    """
    hits = float(batting_stats[info["hits"]])
    at_bats = float(batting_stats[info["atbats"]])
    if at_bats >= MINIMUM_AB:
        return hits / at_bats
    else:
        return 0

def onbase_percentage(info, batting_stats):
    """
    Inputs:
      batting_stats - dictionary of batting statistics (values are strings)
    Output:
      Returns the on-base percentage as a float
    """
    hits = float(batting_stats[info["hits"]])
    at_bats = float(batting_stats[info["atbats"]])
    walks = float(batting_stats[info["walks"]])
    if at_bats >= MINIMUM_AB:
        return (hits + walks) / (at_bats + walks)
    else:
        return 0

def slugging_percentage(info, batting_stats):
    """
    Inputs:
      batting_stats - dictionary of batting statistics (values are strings)
    Output:
      Returns the slugging percentage as a float
    """
    hits = float(batting_stats[info["hits"]])
    doubles = float(batting_stats[info["doubles"]])
    triples = float(batting_stats[info["triples"]])
    home_runs = float(batting_stats[info["homeruns"]])
    singles = hits - doubles - triples - home_runs
    at_bats = float(batting_stats[info["atbats"]])
    if at_bats >= MINIMUM_AB:
        return (singles + 2 * doubles + 3 * triples + 4 * home_runs) / at_bats
    else:
        return 0


##
## Part 1: Functions to compute top batting statistics by year
##

def filter_by_year(statistics, year, yearid):
    """
    Inputs:
      statistics - List of batting statistics dictionaries
      year       - Year to filter by
      yearid     - Year ID field in statistics
    Outputs:
      Returns a list of batting statistics dictionaries that
      are from the input year.
    """
    year_list = []
    year_to_str = str(year)
    
    for data in statistics:
        for key, value in data.items():
            if key == yearid and value == year_to_str:
                year_list.append(data)
                
    return year_list


def top_player_ids(info, statistics, formula, numplayers):
    """
    Inputs:
      info       - Baseball data information dictionary
      statistics - List of batting statistics dictionaries
      formula    - function that takes an info dictionary and a
                   batting statistics dictionary as input and
                   computes a compound statistic
      numplayers - Number of top players to return
    Outputs:
      Returns a list of tuples, player ID and compound statistic
      computed by formula, of the top numplayers players sorted in
      decreasing order of the computed statistic.
    """
    comp_stat = []
    numplayer_stat = []
    
    for player_stat in statistics:
        playerid = player_stat[info['playerid']]
        comp_formula = formula(info, player_stat)
        comp_stat.append((playerid, comp_formula))
        
    comp_stat.sort(key=lambda player: player[1], reverse=True)
    
    for stat in range(numplayers):
        numplayer_stat.append(comp_stat[stat])
    
    return numplayer_stat


def lookup_player_names(info, top_ids_and_stats):
    """
    Inputs:
      info              - Baseball data information dictionary
      top_ids_and_stats - list of tuples containing player IDs and
                          computed statistics
    Outputs:
      List of strings of the form "x.xxx --- FirstName LastName",
      where "x.xxx" is a string conversion of the float stat in
      the input and "FirstName LastName" is the name of the player
      corresponding to the player ID in the input.
    """
    temp = []
    player_details = []
    filename = info['masterfile']
    separator = info['separator']
    quote = info['quote']
    with open(filename, newline='') as csvfile:
        csvreader = csv.DictReader(csvfile, delimiter=separator, quotechar=quote)
           
        for row in csvreader:
            playerid = row[info['playerid']]
            first = row[info['firstname']]
            last = row[info['lastname']]
            temp.append((playerid, first, last))

    
    for player, stats in top_ids_and_stats:
        for playerid, first, last in temp:
            if player == playerid:
                player_details.append("{:1.3f} --- {} {}".format(stats, first, last))
           
    return player_details


def compute_top_stats_year(info, formula, numplayers, year):
    """
    Inputs:
      info        - Baseball data information dictionary
      formula     - function that takes an info dictionary and a
                    batting statistics dictionary as input and
                    computes a compound statistic
      numplayers  - Number of top players to return
      year        - Year to filter by
    Outputs:
      Returns a list of strings for the top numplayers in the given year
      according to the given formula.
    """
    yearid = info['yearid']
    score_file = info['battingfile']
    separator = info['separator']
    quote = info['quote']
    statistics = read_csv_as_list_dict(score_file, separator, quote)
    filt_stat = filter_by_year(statistics, year, yearid)
    top_stat = top_player_ids(info, filt_stat, formula, numplayers)
    top_players = lookup_player_names(info, top_stat)

    return top_players


##
## Part 2: Functions to compute top batting statistics by career
##

def aggregate_by_player_id(statistics, playerid, fields):
    """
    Inputs:
      statistics - List of batting statistics dictionaries
      playerid   - Player ID field name
      fields     - List of fields to aggregate
    Output:
      Returns a nested dictionary whose keys are player IDs and whose values
      are dictionaries of aggregated stats.  Only the fields from the fields
      input will be aggregated in the aggregated stats dictionaries.
    """
    out_dict = {}
    
    for stat in statistics:
        outer = stat[playerid]
        if outer not in out_dict:
            out_dict[outer] = {}
        out_dict[outer][playerid] = outer
        for field in fields:
            if field not in out_dict[outer]:
                out_dict[outer][field] = 0
            out_dict[outer][field] += int(stat[field])
                
    return out_dict
        


def compute_top_stats_career(info, formula, numplayers):
    """
    Inputs:
      info        - Baseball data information dictionary
      formula     - function that takes an info dictionary and a
                    batting statistics dictionary as input and
                    computes a compound statistic
      numplayers  - Number of top players to return
    """
    score_file = info['battingfile']
    fields = info['battingfields']
    separator = info['separator']
    quote = info['quote']
    playerid = info['playerid']
    aggregate_stat_list = []
    statistics = read_csv_as_list_dict(score_file, separator, quote)
    aggregate_stat = aggregate_by_player_id(statistics, playerid, fields)
    for value in aggregate_stat.values():
        aggregate_stat_list.append(value)
    player_stat = top_player_ids(info, aggregate_stat_list, formula, numplayers)
    top_careers = lookup_player_names(info, player_stat)

    return top_careers


##
## Provided testing code
##

def test_baseball_statistics():
    """
    Simple testing code.
    """

    #
    # Dictionary containing information needed to access baseball statistics
    # This information is all tied to the format and contents of the CSV files
    #
    baseballdatainfo = {"masterfile": "Master_2016.csv",   # Name of Master CSV file
                        "battingfile": "Batting_2016.csv", # Name of Batting CSV file
                        "separator": ",",                  # Separator character in CSV files
                        "quote": '"',                      # Quote character in CSV files
                        "playerid": "playerID",            # Player ID field name
                        "firstname": "nameFirst",          # First name field name
                        "lastname": "nameLast",            # Last name field name
                        "yearid": "yearID",                # Year field name
                        "atbats": "AB",                    # At bats field name
                        "hits": "H",                       # Hits field name
                        "doubles": "2B",                   # Doubles field name
                        "triples": "3B",                   # Triples field name
                        "homeruns": "HR",                  # Home runs field name
                        "walks": "BB",                     # Walks field name
                        "battingfields": ["AB", "H", "2B", "3B", "HR", "BB"]}

    print("Top 5 batting averages in 1923")
    top_batting_average_1923 = compute_top_stats_year(baseballdatainfo, batting_average, 5, 1923)
    for player in top_batting_average_1923:
        print(player)
    print("")

    print("Top 10 batting averages in 2010")
    top_batting_average_2010 = compute_top_stats_year(baseballdatainfo, batting_average, 10, 2010)
    for player in top_batting_average_2010:
        print(player)
    print("")

    print("Top 10 on-base percentage in 2010")
    top_onbase_2010 = compute_top_stats_year(baseballdatainfo, onbase_percentage, 10, 2010)
    for player in top_onbase_2010:
        print(player)
    print("")

    print("Top 10 slugging percentage in 2010")
    top_slugging_2010 = compute_top_stats_year(baseballdatainfo, slugging_percentage, 10, 2010)
    for player in top_slugging_2010:
        print(player)
    print("")

    # You can also use lambdas for the formula
    #  This one computes onbase plus slugging percentage
    print("Top 10 OPS in 2010")
    top_ops_2010 = compute_top_stats_year(baseballdatainfo,lambda info, stats: (onbase_percentage(info, stats) +
                                        slugging_percentage(info, stats)), 10, 2010)
    for player in top_ops_2010:
        print(player)
    print("")

    print("Top 20 career batting averages")
    top_batting_average_career = compute_top_stats_career(baseballdatainfo, batting_average, 20)
    for player in top_batting_average_career:
        print(player)
    print("")

    
test_baseball_statistics()

Top 5 batting averages in 1923
0.403 --- Harry Heilmann
0.393 --- Babe Ruth
0.380 --- Tris Speaker
0.371 --- Jim Bottomley
0.360 --- Eddie Collins

Top 10 batting averages in 2010
0.359 --- Josh Hamilton
0.336 --- Carlos Gonzalez
0.328 --- Miguel Cabrera
0.327 --- Joe Mauer
0.324 --- Joey Votto
0.321 --- Adrian Beltre
0.319 --- Robinson Cano
0.318 --- Billy Butler
0.315 --- Ichiro Suzuki
0.312 --- Matt Holliday

Top 10 on-base percentage in 2010
0.422 --- Miguel Cabrera
0.420 --- Joey Votto
0.414 --- Albert Pujols
0.408 --- Josh Hamilton
0.403 --- Joe Mauer
0.393 --- Daric Barton
0.393 --- Adrian Gonzalez
0.392 --- Paul Konerko
0.392 --- Shin-Soo Choo
0.389 --- Billy Butler

Top 10 slugging percentage in 2010
0.633 --- Josh Hamilton
0.622 --- Miguel Cabrera
0.617 --- Jose Bautista
0.600 --- Joey Votto
0.598 --- Carlos Gonzalez
0.596 --- Albert Pujols
0.584 --- Paul Konerko
0.553 --- Adrian Beltre
0.536 --- Adam Dunn
0.534 --- Robinson Cano

Top 10 OPS in 2010
1.045 --- Miguel Cabrera
1

## Thank You!