# MIS 501 Python Football Analysis Project
In this project you will use the skills and concepts we discussed this semester to ingest, manipulate, analyze, and report data using Python.

Some of the more helpful concepts could use to complete this notebook:
* basic syntax, len() function, variables
* conditionals
* looping
* data structures: lists, dictionaries, and sets
* pandas
* regex - this is helpful to get text patterns
* JSON - reading and writing JSON files
* Pathlib for accessing the files

You have been provided a set of JSON files describing footbal games from the 2017 season. The files may or may not include all the games from that season. If a statistic in the provided data conflicts with *actual* real world data, the correct answer is in the *provided* data. 

Use only the JSON files contained in the 'Full' folders (not 'Flattended').

The objective of this project is to answer the set of questions below. The output of your project is a JSON file containing the question (key) and the answer (value). The keys must be in the format qn and the answer is a value appropriate for the question.

The 'season' includes all games provided (including bowl games).

In [1]:
import pathlib
import json
import pprint as pp
import pandas as pd
import webbrowser as wb
import numpy as np
import collections as cl

In [2]:
# Example of how to answer a question
answer_file = {} # create blank dictionary

> You must name the file 'mis501_football_part1_*netid*.json', for example, mis501_football_part1_gjbott.json.

In [3]:
data_filepath = pathlib.Path(r'2017 Alabama football JSON')

#wrap rglob in a list to eliminate it being a generator
all_files = list(data_filepath.rglob(r'**\full\*.json'))

all_files_list = []
#append all games into a list
for file in all_files:
    all_files_list.append(file.name)
    
data_dict = {}
#loading each game_file into a dictionary to be used in future questions
for game_file in all_files:
    with open(game_file, 'r') as fin:
        data_dict[game_file.name] = json.load(fin)

## Question 1 
Explain steps that you took to ensure the data was of good quality. What errors or issues did you find in the data?

In [4]:
text_q1 = """I needed to use rglob to just grab the files within the full directory and not the flatten directory. 
 Because this is a generator, I needed to wrap these files into a list. 
 We did not see problems with the data until working on question 6. I began with indexing through scoringPlays, however the Texas St. vs Georgia St. game in week 11 does not have that as a key.
 I made a copied dictionary of games without this specific game to bypass it. This brought another problem as two-point conversions were considered a "safety" as a scoringPlay.
 So I then had to index through the drives and grab the driveResult. That was able to work. 
 In question 9, we filter through plays to find 19 plays that were 100 yards. One of these results has a penalty for 10907 yards (not possible) and another result recorded 102 yards gained, again not possible.
 So I made the filter = to 100 to eliminate this results.
 Finally, for question 12 I found that the shortest pass reception was -66 which seems unrealistic. After finding that json file and comparing it to the ESPN gamecast for that game, I assume this is a data entry error. 
 The second and third shortest pass receptions seem uncommon but not unrealistic so I can assume they are possible."""
        
            
answer_file['q1'] = text_q1
answer_file['q1']

'I needed to use rglob to just grab the files within the full directory and not the flatten directory. \n Because this is a generator, I needed to wrap these files into a list. \n We did not see problems with the data until working on question 6. I began with indexing through scoringPlays, however the Texas St. vs Georgia St. game in week 11 does not have that as a key.\n I made a copied dictionary of games without this specific game to bypass it. This brought another problem as two-point conversions were considered a "safety" as a scoringPlay.\n So I then had to index through the drives and grab the driveResult. That was able to work. \n In question 9, we filter through plays to find 19 plays that were 100 yards. One of these results has a penalty for 10907 yards (not possible) and another result recorded 102 yards gained, again not possible.\n So I made the filter = to 100 to eliminate this results.\n Finally, for question 12 I found that the shortest pass reception was -66 which see

## Question 2
How many games are in the data set?

In [5]:
#Taking the length of the game list will give us the total games in the data set
games_q2 = len(all_files_list)
answer_file['q2'] = games_q2

## Question 3
What are topmost keys in the JSON file provided for this assignment?

In [6]:
#Using one game_file, we can check the keys for the majority of game_files
#As I worked through each question, I realized some game_files have certain keys missing
keys_q3 = data_dict['400953322 - North Texas vs Troy.json'].keys()
answer_file['q3'] = list(keys_q3)

## Question 4
Provide a Python list of all the teams represented in the files, sorted alphabetically.

In [7]:
all_teams_list = []

#we need to loop through each games values to find the home and away teams for each
for game in data_dict.values():
    
#we index from each game several times to find the unique displayName
    home_team = game['teams'][0]['team']['displayName']
    away_team = game['teams'][1]['team']['displayName']
    
#By using if we can pass each team that is already inside the list of teams
    if home_team in all_teams_list:
        pass
    else:
        all_teams_list.append(home_team)
    
    if away_team in all_teams_list:
        pass
    else:
        all_teams_list.append(away_team)

total_teams_q4 = sorted(all_teams_list)
answer_file['q4'] = total_teams_q4

## Question 5
How many unique teams are represented in the data?

In [8]:
#By finding the length of the teams list we can output the total unique teams
answer_file['q5'] = len(all_teams_list)

## Question 6
A *saftey* in football refers to when the offensive player who has possession of the football is tackled or willingly downs the ball in their end zone. Two points are awared to the defensive team. The offensive team loses possesion of the ball.

In how many games did a safety occur?

In [9]:
#List to store each unique safety within a game
safety_list = []

#List to store each team that forced a safety
total_safeties = []

#List to store each team that gave up a safety
gave_up_safety = []


#We loop through games but also have to loop through each drive to find the drive result that ended with a safety
for game in data_dict.values():
    for drive in game['drives']['previous'][:]:
        if drive.get('displayResult') == 'Safety':
            
            home_team = game['teams'][0]['team']['displayName']
            away_team = game['teams'][1]['team']['displayName']
            
#after storing each team into a variable, we can check to see which team had the ball on that drive and either store the teams into the forced or gave up safety lists
            if drive['team']['displayName'] == home_team:
                team_scored = away_team
                gave_up_safety.append(home_team)
            else:
                team_scored = home_team
                gave_up_safety.append(away_team)
            
#By using an f string, we can store each the matchup and forced safety team to check if more than one safety happened in a single game
            safety = f'{away_team} @ {home_team} - {team_scored}'
           
            
            safety_list.append(safety)
            total_safeties.append(team_scored)
            
#By wrapping the list of games with a safety into a set and taking the length we can find the total games with a safety
#If not, we would have an ouput of 2 values higher because 2 games had multiple safeties
games_with_safety = len(set(safety_list))
answer_file['q6'] = games_with_safety

## Question 7
Which teams scored the most safeties?

In [10]:
#Using the collections application we can count the total safeties for each team and sort them in descending order
teams_scored_safety = cl.Counter(total_safeties).most_common()

#Indexing the first 5 teams gives us the teams with the most safeties
top_scored_safety_teams =teams_scored_safety[0:5]

answer_file['q7'] = top_scored_safety_teams

## Question 8
Which teams gave up the most safeties?

In [11]:
#Using the collections application we can count the total safeties for each team and sort them in descending order
teams_gave_up_safety = cl.Counter(gave_up_safety).most_common()

#Indexing the first 4 teams gives us the teams with the most safeties
top_teams_gave_up_safety = teams_gave_up_safety[0:4]

answer_file['q8'] = top_teams_gave_up_safety

## Question 9
Find the longest play for the 2017 season. (Ex. a 99 yard interception return) If there are several
of the same length, show them all. Show team matchup, quarter, clocktime, and play text for each of the plays.

In [12]:
longest_plays = {}

#We loop through games but also have to loop through each drive and then again through each play to find the ones that had 100 yards gained
#100 yards is assumed to be the maximum one play can attribute
for game in data_dict.values():
    for drive in game['drives']['previous'][:]:
        for play in drive['plays'][:]:
            if play['statYardage'] == 100:
                
                yards_gained = play['statYardage']
                home_team = game['teams'][0]['team']['displayName']
                away_team = game['teams'][1]['team']['displayName']
                matchup = f'{away_team} @ {home_team}'
                quarter = play['period']['number']
                clocktime = play['clock']['displayValue']
                play_text = play['text']

#Each play value that we found is stored within a key for the game it was in and then added to longest_play dictionary
                each_play = {
                    'yards_gained': yards_gained,
                    'quarter': quarter,
                    'clocktime': clocktime,
                    'play_text': play_text,
                }
                
                if matchup in longest_plays:
                    longest_plays[matchup].append(each_play)
                else:
                    longest_plays[matchup] = [each_play]
                
                
answer_file['q9'] = longest_plays


## Question 10
How long were Alabama's FIRST and LAST offensive plays of the season? Show the date, opponent, yards gained/lost on the play, and provide the description of the play.

In [13]:
bama_first_and_last_play = {}

#I found the first game_file for Alabama in Week 1 and indexed the date, opponent, yards, and text of the first play
fdate = data_dict['400933827 - Alabama vs Florida State.json']['competitions'][0]['date']
fopponent = data_dict['400933827 - Alabama vs Florida State.json']['teams'][1]['team']['displayName']
fyards = data_dict['400933827 - Alabama vs Florida State.json']['drives']['previous'][0]['plays'][1]['statYardage']
fdescription = data_dict['400933827 - Alabama vs Florida State.json']['drives']['previous'][0]['plays'][1]['text']                                                                   

#Added these values under the "FIRST" key in a dictionary
bama_first_and_last_play['FIRST'] = {'date': fdate, 'opponent': fopponent, 'yards': f'{fyards} gained', 'description': fdescription}

#I found the last game_file for Alabama in Bowls and indexed the date, opponent, yards, and text of the last play
ldate = data_dict['400953415 - Georgia vs Alabama.json']['competitions'][0]['date']
lopponent = data_dict['400953415 - Georgia vs Alabama.json']['teams'][1]['team']['displayName']
lyards = data_dict['400953415 - Georgia vs Alabama.json']['drives']['previous'][28]['plays'][1]['statYardage']
ldescription = data_dict['400953415 - Georgia vs Alabama.json']['drives']['previous'][28]['plays'][1]['text']  

#Added these values under the "LAST" key in a dictionary
bama_first_and_last_play['LAST'] = {'date': ldate, 'opponent': lopponent, 'yards': f'{lyards} gained', 'description': ldescription}

answer_file['q10'] = bama_first_and_last_play


            

## Question 11
How many times did Alabama punt all season?

In [14]:
season_punts = []

#We can loop through each game and drive to find the results where Alabama punted in each drive of the season
for game in data_dict.values():
    for drive in game['drives']['previous'][:]:
        if drive.get('displayResult') == 'Punt':
            if drive['team']['displayName'] == 'Alabama Crimson Tide':
                
#We can store each punt by the week it happened to make sure we have reasonable numbers in a list
                punt = drive['displayResult']
                week = game['week']
                play_punt = f'{week} - {punt}'
                
                season_punts.append(play_punt)

answer_file['q11'] = len(season_punts)

## Question 12
What as the longest, shortest, and median pass distance? Provide the labels (longest, shortest, and median) as a dict in your answer.

In [15]:
pass_distance = {}
pass_play_yardages = []

#We can loop through each play in each drive of every game to find where the type of play was a "Pass Reception"
for game in data_dict.values():
    for drive in game['drives']['previous'][:]:
        for play in drive['plays'][:]:

#We need to add a .get function here to skip the game_files that do not have a key for "type" of play
            if play.get('type') == None:
                pass
            else:
                if play['type']['text'] == 'Pass Reception':

#I stored each reception into a list using the "statYardage" value
                    yards = play['statYardage']

                    pass_play_yardages.append(yards)

#We can take this list and use numpy to find the max, min, and median of the list
longest = int(np.max(pass_play_yardages))
shortest = int(np.min(pass_play_yardages))
median = int(np.median(pass_play_yardages))

#The variables above give us the longest, shortest and median pass distance
pass_distance['longest'] = longest
pass_distance['shortest'] = shortest
pass_distance['median'] = median

answer_file['q12'] = pass_distance


In [16]:
#I have left this here to show how much of an outlier our shortest pass is
#After scanning ESPN and the game's json file, it can be assumed this was a data entry error
sorted_plays = sorted(pass_play_yardages)
sorted_plays[0:10]

[-66, -17, -14, -14, -14, -14, -14, -13, -13, -13]

## Bonus (5 points)
Locate and play a video highlight from one of the games.

In [19]:
#We can index within a specific game and find the url to the video
video = data_dict['400953322 - North Texas vs Troy.json']['videos'][1]['links']['web']['href']

#Using webbrowser we can use the open function to play the url video

answer_file['Bonus'] = wb.open(video)

## JSON File 
Dump answer dictionary into a json file.

In [21]:
with open(r'mis501_football_part1_sawalker.json', 'w') as fout:
    json.dump(answer_file, fout)