# <center> Dataset Links </center>

The following list of links should serve as a great starting point on a broad range of topics for putting together problems and/or ideas for your Capstone Project:

1.  [Kaggle](https://www.kaggle.com/) (One of the most popular places for downloading and exploring a broad range of datasets)
2.  [FiveThirtyEight](https://data.fivethirtyeight.com/)
3.  [Google Dataset Search](https://toolbox.google.com/datasetsearch)
4.  [Data Science Dojo](https://blog.datasciencedojo.com/30-datasets-to-uplift-your-skills-in-data-science/) (Would recommend going for Intermediate or Advanced Datasets here)
5.  [DrivenData](https://www.drivendata.org/competitions/)
6.  [Data.World](https://data.world/)
7.  [Challenge.gov](https://www.challenge.gov)
8.  [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
9.  [UFL Misc Datasets](http://www.stat.ufl.edu/~winner/datasets.html)
10. [Datasets Posted on Reddit](https://www.reddit.com/r/datasets/)
11. [San Antonio River Authority Datasets](http://exploresara-sara-tx.opendata.arcgis.com/)
12. [CI Now Open Data Resources](http://nowdata.cinow.info/indicators/local-open-data-portals)
13. [Github Link with Broad Range of Datasets](https://github.com/brohrer/academic_advisory/blob/master/use_cases.md)
14. [OpenAire Datasets](https://explore.openaire.eu/)
15. [IBM Data Asset Exchange](https://developer.ibm.com/exchanges/data/)
16. [Another Github Link with Broad Range of Datasets](https://github.com/awesomedata/awesome-public-datasets)
17. [World Bank Open Data](https://data.worldbank.org/)
18. [International Money Fund Data](https://www.imf.org/en/Data)
19. [National Transit Data](https://www.transit.dot.gov/ntd/ntd-data)
20. [Georgia Governor's Office of Student Achievement Data](https://gosa.georgia.gov/downloadable-data)
21. [USA Spending.gov](https://www.usaspending.gov/#/download_center/custom_award_data)
22. [Gapminder](https://www.gapminder.org/data/)


## <center>Sports Related Data</center>

### Sports Reference API

The [Sports Reference API](https://pypi.org/project/sportsreference/) can help you get access to data for different sports including MLB, NHL, NBA, NFL, NCAA Men's Basketball and NCAA Football.  This also gives you access to the different and unique statistics for each sport that [Sports Reference](https://www.sports-reference.com) reports out after each game.

### NFL Data

1.  [NFL Data from the nflscrapR Package](https://github.com/ryurko/nflscrapR-data)  

    This repo contains NFL Play-By-Play (PBP) (can also be downloaded off of [Kaggle](https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016)) and Games Data from 2009-2018 .  The repo owner updates the PBP data sporadically throughout the current season so if you want 2019 data go there rather than Kaggle. One way to access the data off the github repo is the following:

In [1]:
import pandas as pd

url = 'https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/play_by_play_data/regular_season/reg_pbp_2019.csv'
reg_pbp_2019 = pd.read_csv(url, low_memory = False)
reg_pbp_2019.head()

Unnamed: 0,play_id,game_id,home_team,away_team,posteam,posteam_type,defteam,side_of_field,yardline_100,game_date,...,penalty_player_id,penalty_player_name,penalty_yards,replay_or_challenge,replay_or_challenge_result,penalty_type,defensive_two_point_attempt,defensive_two_point_conv,defensive_extra_point_attempt,defensive_extra_point_conv
0,35,2019090500,CHI,GB,GB,away,CHI,CHI,35.0,2019-09-05,...,,,,0,,,0.0,0.0,0.0,0.0
1,50,2019090500,CHI,GB,GB,away,CHI,GB,75.0,2019-09-05,...,,,,0,,,0.0,0.0,0.0,0.0
2,71,2019090500,CHI,GB,GB,away,CHI,GB,75.0,2019-09-05,...,,,,0,,,0.0,0.0,0.0,0.0
3,95,2019090500,CHI,GB,GB,away,CHI,GB,75.0,2019-09-05,...,,,,0,,,0.0,0.0,0.0,0.0
4,125,2019090500,CHI,GB,GB,away,CHI,GB,85.0,2019-09-05,...,,,,0,,,0.0,0.0,0.0,0.0


2.  [nflscrapR package](https://github.com/maksimhorowitz/nflscrapR)

    You can also access the current season data quicker if the repo owner isn't fast enough for you through this package.  However this does require installation of R and R-Studio as well as following the directions given in the nflscrapR documentation.  After installation of the packages, the R code is rather simple for getting the data by importing the nflscrapR library calling the PBP function and then save the data to a CSV for analysis in Python.  Below is an example of the 3 lines of code needed to run in either R-Studio or an R-Kernel in Jupyter if you have one.

In [None]:
library(nflscrapR)

# The following will take several minutes to run
pbp_2019 <- season_play_by_play(2019)

write.csv(pbp_2019, "./data/reg_pbp_2019.csv")

3.  [NFL Big Data Bowl Kaggle Competition](https://www.kaggle.com/c/nfl-big-data-bowl-2020)

    See if you can predict how many yards a team will gain on given rushing plays as they happen (or whatever else you can think of predicting with the given data).

4.  [NFLGame Python Package](http://nflgame.derekadair.com/)

    A Python package for getting NFL Data.

### College Football Data

[CollegeFootballData](https://api.collegefootballdata.com/api/docs/?url=/api-docs.json)

   API for accessing College Football data.  The following are a couple of examples of getting drive data (by season), game data (by season) and play-by-play data (by season and week) for the 2019 season.

In [2]:
import pandas as pd
import requests

year = 2019

drive_data = pd.DataFrame(requests.get('https://api.collegefootballdata.com/drives?seasonType=regular&year=' + \
                                       str(year)).json())

game_data = pd.DataFrame(requests.get('https://api.collegefootballdata.com/games?year=' + str(year) + \
                                      '&seasonType=regular').json())

# Can loop over weeks to get PBP data by week
plays_data = pd.DataFrame(requests.get('https://api.collegefootballdata.com/plays?seasonType=regular&year=' + str(year) + \
                                       '&week=' + str(1)).json())

In [3]:
drive_data.head()

Unnamed: 0,offense,offense_conference,defense,defense_conference,game_id,id,scoring,start_period,start_yardline,start_time,end_period,end_yardline,end_time,elapsed,plays,yards,drive_result
0,Alabama,SEC,Duke,ACC,401110720,4011107201,False,1,22,"{'minutes': 15, 'seconds': 0}",1,19,"{'minutes': 12, 'seconds': 57}","{'minutes': 2, 'seconds': 3}",3,-3,PUNT
1,Duke,ACC,Alabama,SEC,401110720,4011107202,False,1,55,"{'minutes': 12, 'seconds': 57}",1,52,"{'minutes': 11, 'seconds': 49}","{'minutes': 1, 'seconds': 8}",3,3,PUNT
2,Alabama,SEC,Duke,ACC,401110720,4011107203,False,1,17,"{'minutes': 11, 'seconds': 49}",1,23,"{'minutes': 11, 'seconds': 24}",{'seconds': 25},2,9,FUMBLE
3,Duke,ACC,Alabama,SEC,401110720,4011107204,False,1,26,"{'minutes': 11, 'seconds': 24}",1,7,"{'minutes': 8, 'seconds': 28}","{'minutes': 2, 'seconds': 56}",7,19,DOWNS
4,Alabama,SEC,Duke,ACC,401110720,4011107205,False,1,7,"{'minutes': 8, 'seconds': 28}",1,69,"{'minutes': 3, 'seconds': 25}","{'minutes': 5, 'seconds': 3}",12,62,MISSED FG


In [4]:
game_data.head()

Unnamed: 0,id,season,week,season_type,start_date,neutral_site,conference_game,attendance,venue_id,venue,home_team,home_conference,home_points,home_line_scores,away_team,away_conference,away_points,away_line_scores
0,401110723,2019,1,regular,2019-08-24T23:00:00.000Z,True,False,,4013,Camping World Stadium,Miami,ACC,20.0,"[3, 10, 0, 7]",Florida,SEC,24.0,"[7, 0, 10, 7]"
1,401114164,2019,1,regular,2019-08-25T02:30:00.000Z,False,False,,3610,Aloha Stadium,Hawai'i,Mountain West,45.0,"[14, 14, 7, 10]",Arizona,Pac-12,38.0,"[0, 21, 14, 3]"
2,401119254,2019,1,regular,2019-08-29T23:00:00.000Z,False,False,,3700,Doyt Perry Stadium,Bowling Green,Mid-American,46.0,"[13, 17, 7, 9]",Morgan State,,3.0,"[0, 3, 0, 0]"
3,401119255,2019,1,regular,2019-08-29T23:00:00.000Z,False,False,,3965,UB Stadium,Buffalo,Mid-American,38.0,"[21, 7, 10, 0]",Robert Morris,,10.0,"[7, 3, 0, 0]"
4,401117855,2019,1,regular,2019-08-29T23:00:00.000Z,False,False,,3892,Rentschler Field,Connecticut,American Athletic,24.0,"[7, 3, 14, 0]",Wagner,,21.0,"[0, 0, 14, 7]"


In [5]:
plays_data.head()

Unnamed: 0,id,offense,offense_conference,defense,defense_conference,home,away,offense_score,defense_score,drive_id,period,clock,yard_line,down,distance,yards_gained,play_type,play_text
0,401110720101866901,Alabama,SEC,Duke,ACC,Duke,Alabama,0,0,4011107201,1,"{'minutes': 13, 'seconds': 30}",25,3,7,-6,Sack,Tua Tagovailoa sacked by Koby Quansah for a lo...
1,401110720101858401,Alabama,SEC,Duke,ACC,Duke,Alabama,0,0,4011107201,1,"{'minutes': 14, 'seconds': 15}",23,2,9,2,Pass Reception,Tua Tagovailoa pass complete to Jerome Ford fo...
2,401110720101855301,Alabama,SEC,Duke,ACC,Duke,Alabama,0,0,4011107201,1,"{'minutes': 14, 'seconds': 46}",22,1,10,1,Rush,Jerome Ford run for 1 yd to the Alab 23
3,401110720101874201,Alabama,SEC,Duke,ACC,Duke,Alabama,0,0,4011107201,1,"{'minutes': 12, 'seconds': 57}",19,4,13,3,Punt,"Will Reichard punt for 39 yds , Josh Blackwell..."
4,401110720101849902,Duke,ACC,Alabama,SEC,Duke,Alabama,0,0,4011107201,1,"{'minutes': 15, 'seconds': 0}",65,1,10,22,Kickoff,"AJ Reed kickoff for 65 yds , Henry Ruggs III r..."


### Hockey Data

[Hockey-Scraper](https://github.com/HarryShomer/Hockey-Scraper) and [Documentation](https://hockey-scraper.readthedocs.io/en/latest/)

### NBA Data

[NBA_API](https://github.com/swar/nba_api)

### Major League Baseball Data

[PyBaseball](https://github.com/jldbc/pybaseball)

### Soccer Data

1.  [StatsBomb](https://statsbomb.com/resource-centre/) Need to sign up for data, but it's free.

2.  [International Results from 1872-2019](https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017) Kaggle data set updated regularly

### NCAA Basketball Data

1. [Kaggle BigQuery Dataset](https://www.kaggle.com/ncaa/ncaa-basketball) Requires Knowledge of SQL
2. [Men's 2019 NCAA ML Kaggle Compeition](https://www.kaggle.com/c/mens-machine-learning-competition-2019)
3. [Women's 2019 NCAA ML Kaggle Competition](https://www.kaggle.com/c/womens-machine-learning-competition-2019)

### Golf Data

[PGA Golf Tour Data](https://www.kaggle.com/bradklassen/pga-tour-20102018-data) Kaggle Dataset updated regularly