# Imports

In [1]:
import pandas as pd
from os import listdir
from os.path import join
from IPython.display import display, Markdown
raw_data_path = '../../data/raw/'

# Inventory of raw data
In this notebook we take an inventory of the files in the `data/raw` folder.


In [2]:
class data_inventory():
    
    def __init__(self, directory_path):
        self.directory_path = directory_path
        self.csv_file_list = self.make_csv_file_list()
        
        
    def make_csv_file_list(self):
        """Makes a list of all csv files in the specified directory.
        """
        csv_file_list = sorted(
            [file for file in listdir(self.directory_path) if file.endswith('csv')]
        )
        return csv_file_list
    
    @staticmethod
    def print_file_info(csv_file):
        """Displays a basic summary of the contents of the specified csv file.
        """
        file_path = join(raw_data_path, csv_file)
        df = pd.read_csv(file_path)
        display(Markdown('## '+(csv_file)))
        display(Markdown('### Info'))
        display(df.info())
        display(Markdown('### Head'))
        display(df.head())
        display(Markdown('### Comments'))
    
    def next_file(self):
        """Prints summary of next file in list.
        """
        try:
            csv_file = self.csv_file_list.pop(0)
            self.print_file_info(csv_file)
        except IndexError:
            display(Markdown('# End of Data Inventory'))

inventory = data_inventory(raw_data_path)

In [3]:
inventory.next_file()

## Cities.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 453 entries, 0 to 452
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   CityID  453 non-null    int64 
 1   City    453 non-null    object
 2   State   453 non-null    object
dtypes: int64(1), object(2)
memory usage: 10.7+ KB


None

### Head

Unnamed: 0,CityID,City,State
0,4001,Abilene,TX
1,4002,Akron,OH
2,4003,Albany,NY
3,4004,Albuquerque,NM
4,4005,Allentown,PA


### Comments

This table provides city and state attributes for each `CityID`. It does not add any information about game play, but it will be useful for labeling visualizations.

In [4]:
inventory.next_file()

## Conferences.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ConfAbbrev   51 non-null     object
 1   Description  51 non-null     object
dtypes: object(2)
memory usage: 944.0+ bytes


None

### Head

Unnamed: 0,ConfAbbrev,Description
0,a_sun,Atlantic Sun Conference
1,a_ten,Atlantic 10 Conference
2,aac,American Athletic Conference
3,acc,Atlantic Coast Conference
4,aec,America East Conference


### Comments

This table provides descriptions of the conference abbreviations. It does not add any information about game play, but it will be useful for interpreting and labeling results and visualizations.

In [5]:
inventory.next_file()

## MConferenceTourneyGames.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5308 entries, 0 to 5307
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Season      5308 non-null   int64 
 1   ConfAbbrev  5308 non-null   object
 2   DayNum      5308 non-null   int64 
 3   WTeamID     5308 non-null   int64 
 4   LTeamID     5308 non-null   int64 
dtypes: int64(4), object(1)
memory usage: 207.5+ KB


None

### Head

Unnamed: 0,Season,ConfAbbrev,DayNum,WTeamID,LTeamID
0,2001,a_sun,121,1194,1144
1,2001,a_sun,121,1416,1240
2,2001,a_sun,122,1209,1194
3,2001,a_sun,122,1359,1239
4,2001,a_sun,122,1391,1273


### Comments

This table seems to summarize conference tournament play.

**Investigate this table**

In [6]:
inventory.next_file()

## MGameCities.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60166 entries, 0 to 60165
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Season   60166 non-null  int64 
 1   DayNum   60166 non-null  int64 
 2   WTeamID  60166 non-null  int64 
 3   LTeamID  60166 non-null  int64 
 4   CRType   60166 non-null  object
 5   CityID   60166 non-null  int64 
dtypes: int64(5), object(1)
memory usage: 2.8+ MB


None

### Head

Unnamed: 0,Season,DayNum,WTeamID,LTeamID,CRType,CityID
0,2010,7,1143,1293,Regular,4027
1,2010,7,1314,1198,Regular,4061
2,2010,7,1326,1108,Regular,4080
3,2010,7,1393,1107,Regular,4340
4,2010,9,1143,1178,Regular,4027


### Comments

Seems to identify the city where a game takes place.

**Investigate CRType**

In [7]:
inventory.next_file()

## MMasseyOrdinals.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4120886 entries, 0 to 4120885
Data columns (total 5 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   Season         int64 
 1   RankingDayNum  int64 
 2   SystemName     object
 3   TeamID         int64 
 4   OrdinalRank    int64 
dtypes: int64(4), object(1)
memory usage: 157.2+ MB


None

### Head

Unnamed: 0,Season,RankingDayNum,SystemName,TeamID,OrdinalRank
0,2003,35,SEL,1102,159
1,2003,35,SEL,1103,229
2,2003,35,SEL,1104,12
3,2003,35,SEL,1105,314
4,2003,35,SEL,1106,260


### Comments

Seems to compile many ranking systems for conference play.

In [8]:
inventory.next_file()

## MNCAATourneyCompactResults.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2251 entries, 0 to 2250
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Season   2251 non-null   int64 
 1   DayNum   2251 non-null   int64 
 2   WTeamID  2251 non-null   int64 
 3   WScore   2251 non-null   int64 
 4   LTeamID  2251 non-null   int64 
 5   LScore   2251 non-null   int64 
 6   WLoc     2251 non-null   object
 7   NumOT    2251 non-null   int64 
dtypes: int64(7), object(1)
memory usage: 140.8+ KB


None

### Head

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,136,1116,63,1234,54,N,0
1,1985,136,1120,59,1345,58,N,0
2,1985,136,1207,68,1250,43,N,0
3,1985,136,1229,58,1425,55,N,0
4,1985,136,1242,49,1325,38,N,0


### Comments

Provides a summary of the outcome of every tournament game.

In [9]:
inventory.next_file()

## MNCAATourneyDetailedResults.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 34 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Season   1115 non-null   int64 
 1   DayNum   1115 non-null   int64 
 2   WTeamID  1115 non-null   int64 
 3   WScore   1115 non-null   int64 
 4   LTeamID  1115 non-null   int64 
 5   LScore   1115 non-null   int64 
 6   WLoc     1115 non-null   object
 7   NumOT    1115 non-null   int64 
 8   WFGM     1115 non-null   int64 
 9   WFGA     1115 non-null   int64 
 10  WFGM3    1115 non-null   int64 
 11  WFGA3    1115 non-null   int64 
 12  WFTM     1115 non-null   int64 
 13  WFTA     1115 non-null   int64 
 14  WOR      1115 non-null   int64 
 15  WDR      1115 non-null   int64 
 16  WAst     1115 non-null   int64 
 17  WTO      1115 non-null   int64 
 18  WStl     1115 non-null   int64 
 19  WBlk     1115 non-null   int64 
 20  WPF      1115 non-null   int64 
 21  LFGM     1115 non-null   int64 
 22  

None

### Head

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
0,2003,134,1421,92,1411,84,N,1,32,69,...,31,14,31,17,28,16,15,5,0,22
1,2003,136,1112,80,1436,51,N,0,31,66,...,16,7,7,8,26,12,17,10,3,15
2,2003,136,1113,84,1272,71,N,0,31,59,...,28,14,21,20,22,11,12,2,5,18
3,2003,136,1141,79,1166,73,N,0,29,53,...,17,12,17,14,17,20,21,6,6,21
4,2003,136,1143,76,1301,74,N,1,27,64,...,21,15,20,10,26,16,14,5,8,19


### Comments

**This table requires substantial investigation**

In [10]:
inventory.next_file()

## MNCAATourneySeedRoundSlots.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 720 entries, 0 to 719
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Seed         720 non-null    object
 1   GameRound    720 non-null    int64 
 2   GameSlot     720 non-null    object
 3   EarlyDayNum  720 non-null    int64 
 4   LateDayNum   720 non-null    int64 
dtypes: int64(3), object(2)
memory usage: 28.2+ KB


None

### Head

Unnamed: 0,Seed,GameRound,GameSlot,EarlyDayNum,LateDayNum
0,W01,1,R1W1,136,137
1,W01,2,R2W1,138,139
2,W01,3,R3W1,143,144
3,W01,4,R4W1,145,146
4,W01,5,R5WX,152,152


### Comments

This seems to describe the general seed structure for the tournament.

In [11]:
inventory.next_file()

## MNCAATourneySeeds.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2286 entries, 0 to 2285
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Season  2286 non-null   int64 
 1   Seed    2286 non-null   object
 2   TeamID  2286 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 53.7+ KB


None

### Head

Unnamed: 0,Season,Seed,TeamID
0,1985,W01,1207
1,1985,W02,1210
2,1985,W03,1228
3,1985,W04,1260
4,1985,W05,1374


### Comments

This seems to provide historical seeds.

In [12]:
inventory.next_file()

## MNCAATourneySlots.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2251 entries, 0 to 2250
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Season      2251 non-null   int64 
 1   Slot        2251 non-null   object
 2   StrongSeed  2251 non-null   object
 3   WeakSeed    2251 non-null   object
dtypes: int64(1), object(3)
memory usage: 70.5+ KB


None

### Head

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed
0,1985,R1W1,W01,W16
1,1985,R1W2,W02,W15
2,1985,R1W3,W03,W14
3,1985,R1W4,W04,W13
4,1985,R1W5,W05,W12


### Comments

**Investigate further**

In [13]:
inventory.next_file()

## MRegularSeasonCompactResults.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166880 entries, 0 to 166879
Data columns (total 8 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   Season   166880 non-null  int64 
 1   DayNum   166880 non-null  int64 
 2   WTeamID  166880 non-null  int64 
 3   WScore   166880 non-null  int64 
 4   LTeamID  166880 non-null  int64 
 5   LScore   166880 non-null  int64 
 6   WLoc     166880 non-null  object
 7   NumOT    166880 non-null  int64 
dtypes: int64(7), object(1)
memory usage: 10.2+ MB


None

### Head

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


### Comments

Summarizes the outcomes of regular season games.

In [14]:
inventory.next_file()

## MRegularSeasonDetailedResults.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92832 entries, 0 to 92831
Data columns (total 34 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Season   92832 non-null  int64 
 1   DayNum   92832 non-null  int64 
 2   WTeamID  92832 non-null  int64 
 3   WScore   92832 non-null  int64 
 4   LTeamID  92832 non-null  int64 
 5   LScore   92832 non-null  int64 
 6   WLoc     92832 non-null  object
 7   NumOT    92832 non-null  int64 
 8   WFGM     92832 non-null  int64 
 9   WFGA     92832 non-null  int64 
 10  WFGM3    92832 non-null  int64 
 11  WFGA3    92832 non-null  int64 
 12  WFTM     92832 non-null  int64 
 13  WFTA     92832 non-null  int64 
 14  WOR      92832 non-null  int64 
 15  WDR      92832 non-null  int64 
 16  WAst     92832 non-null  int64 
 17  WTO      92832 non-null  int64 
 18  WStl     92832 non-null  int64 
 19  WBlk     92832 non-null  int64 
 20  WPF      92832 non-null  int64 
 21  LFGM     92832 non-null  int64 
 22

None

### Head

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
0,2003,10,1104,68,1328,62,N,0,27,58,...,10,16,22,10,22,8,18,9,2,20
1,2003,10,1272,70,1393,63,N,0,26,62,...,24,9,20,20,25,7,12,8,6,16
2,2003,11,1266,73,1437,61,N,0,24,58,...,26,14,23,31,22,9,12,2,5,23
3,2003,11,1296,56,1457,50,N,0,18,38,...,22,8,15,17,20,9,19,4,3,23
4,2003,11,1400,77,1208,71,N,0,30,61,...,16,17,27,21,15,12,10,7,1,14


### Comments

Gives detailed information about regular season games.

In [15]:
inventory.next_file()

## MSampleSubmissionStage1.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11390 entries, 0 to 11389
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      11390 non-null  object 
 1   Pred    11390 non-null  float64
dtypes: float64(1), object(1)
memory usage: 178.1+ KB


None

### Head

Unnamed: 0,ID,Pred
0,2015_1107_1112,0.5
1,2015_1107_1116,0.5
2,2015_1107_1124,0.5
3,2015_1107_1125,0.5
4,2015_1107_1129,0.5


### Comments

Example file for Stage 1 submissions.

In [16]:
inventory.next_file()

## MSeasons.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Season   37 non-null     int64 
 1   DayZero  37 non-null     object
 2   RegionW  37 non-null     object
 3   RegionX  37 non-null     object
 4   RegionY  37 non-null     object
 5   RegionZ  37 non-null     object
dtypes: int64(1), object(5)
memory usage: 1.9+ KB


None

### Head

Unnamed: 0,Season,DayZero,RegionW,RegionX,RegionY,RegionZ
0,1985,10/29/1984,East,West,Midwest,Southeast
1,1986,10/28/1985,East,Midwest,Southeast,West
2,1987,10/27/1986,East,Southeast,Midwest,West
3,1988,11/2/1987,East,Midwest,Southeast,West
4,1989,10/31/1988,East,West,Midwest,Southeast


### Comments

Provides season configuration data.

In [17]:
inventory.next_file()

## MSecondaryTourneyCompactResults.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1624 entries, 0 to 1623
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Season            1624 non-null   int64 
 1   DayNum            1624 non-null   int64 
 2   WTeamID           1624 non-null   int64 
 3   WScore            1624 non-null   int64 
 4   LTeamID           1624 non-null   int64 
 5   LScore            1624 non-null   int64 
 6   WLoc              1624 non-null   object
 7   NumOT             1624 non-null   int64 
 8   SecondaryTourney  1624 non-null   object
dtypes: int64(7), object(2)
memory usage: 114.3+ KB


None

### Head

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,SecondaryTourney
0,1985,136,1151,67,1155,65,H,0,NIT
1,1985,136,1153,77,1245,61,H,0,NIT
2,1985,136,1201,79,1365,76,H,0,NIT
3,1985,136,1231,79,1139,57,H,0,NIT
4,1985,136,1249,78,1222,71,H,0,NIT


### Comments

**What is the Secondary Tournament?**

In [18]:
inventory.next_file()

## MSecondaryTourneyTeams.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1642 entries, 0 to 1641
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Season            1642 non-null   int64 
 1   SecondaryTourney  1642 non-null   object
 2   TeamID            1642 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 38.6+ KB


None

### Head

Unnamed: 0,Season,SecondaryTourney,TeamID
0,1985,NIT,1108
1,1985,NIT,1133
2,1985,NIT,1139
3,1985,NIT,1145
4,1985,NIT,1151


### Comments

**What is the Secondary Tournament?**

In [19]:
inventory.next_file()

## MTeamCoaches.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11704 entries, 0 to 11703
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Season       11704 non-null  int64 
 1   TeamID       11704 non-null  int64 
 2   FirstDayNum  11704 non-null  int64 
 3   LastDayNum   11704 non-null  int64 
 4   CoachName    11704 non-null  object
dtypes: int64(4), object(1)
memory usage: 457.3+ KB


None

### Head

Unnamed: 0,Season,TeamID,FirstDayNum,LastDayNum,CoachName
0,1985,1102,0,154,reggie_minton
1,1985,1103,0,154,bob_huggins
2,1985,1104,0,154,wimp_sanderson
3,1985,1106,0,154,james_oliver
4,1985,1108,0,154,davey_whitney


### Comments

Identifies the coach of each team.

In [20]:
inventory.next_file()

## MTeamConferences.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11941 entries, 0 to 11940
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Season      11941 non-null  int64 
 1   TeamID      11941 non-null  int64 
 2   ConfAbbrev  11941 non-null  object
dtypes: int64(2), object(1)
memory usage: 280.0+ KB


None

### Head

Unnamed: 0,Season,TeamID,ConfAbbrev
0,1985,1102,wac
1,1985,1103,ovc
2,1985,1104,sec
3,1985,1106,swac
4,1985,1108,swac


### Comments

Identifies the conference of each team.

In [21]:
inventory.next_file()

## MTeamSpellings.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1147 entries, 0 to 1146
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   TeamNameSpelling  1147 non-null   object
 1   TeamID            1147 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 18.0+ KB


None

### Head

Unnamed: 0,TeamNameSpelling,TeamID
0,a&m-corpus chris,1394
1,a&m-corpus christi,1394
2,abilene chr,1101
3,abilene christian,1101
4,abilene-christian,1101


### Comments

Provides alternate spellings and abbreviations of team names.

In [22]:
inventory.next_file()

## MTeams.csv

### Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371 entries, 0 to 370
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   TeamID         371 non-null    int64 
 1   TeamName       371 non-null    object
 2   FirstD1Season  371 non-null    int64 
 3   LastD1Season   371 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 11.7+ KB


None

### Head

Unnamed: 0,TeamID,TeamName,FirstD1Season,LastD1Season
0,1101,Abilene Chr,2014,2021
1,1102,Air Force,1985,2021
2,1103,Akron,1985,2021
3,1104,Alabama,1985,2021
4,1105,Alabama A&M,2000,2021


### Comments

Provides official team name and range of D1 competitive seasons for each team.

In [23]:
inventory.next_file()

# End of Data Inventory