## Imports

In [4]:
import pandas as pd 

Dataset imports:

In [5]:
nba = pd.read_csv("./Datasets/nba_elo.csv")
pollution = pd.read_csv("./Datasets/pollution_2000_2023.csv", index_col=[0])
us_accidents = pd.read_csv("./Datasets/US_Accidents_March23.csv")
nba_teams = pd.read_csv("./Datasets/team.csv")

# Identification of the columns 

## NBA Games dataset

In [63]:
nba.head()

Unnamed: 0,index,date,season,neutral,playoff,team1,team2,elo1_pre,elo2_pre,elo_prob1,...,elo1_post,elo2_post,carmelo1_pre,carmelo2_pre,carmelo1_post,carmelo2_post,carmelo_prob1,carmelo_prob2,score1,score2
0,0,1946-11-01,1947,0,,TRH,NYK,1300.0,1300.0,0.640065,...,1293.2767,1306.7233,,,,,,,66.0,68.0
1,1,1946-11-02,1947,0,,CHS,NYK,1300.0,1306.7233,0.631101,...,1309.6521,1297.0712,,,,,,,63.0,47.0
2,2,1946-11-02,1947,0,,PRO,BOS,1300.0,1300.0,0.640065,...,1305.1542,1294.8458,,,,,,,59.0,53.0
3,3,1946-11-02,1947,0,,STB,PIT,1300.0,1300.0,0.640065,...,1304.6908,1295.3092,,,,,,,56.0,51.0
4,4,1946-11-02,1947,0,,DTF,WSC,1300.0,1300.0,0.640065,...,1279.6189,1320.3811,,,,,,,33.0,50.0


Dataset column names.

In [6]:
nba.columns

Index(['index', 'date', 'season', 'neutral', 'playoff', 'team1', 'team2',
       'elo1_pre', 'elo2_pre', 'elo_prob1', 'elo_prob2', 'elo1_post',
       'elo2_post', 'carmelo1_pre', 'carmelo2_pre', 'carmelo1_post',
       'carmelo2_post', 'carmelo_prob1', 'carmelo_prob2', 'score1', 'score2'],
      dtype='object')

The NBA dataset includes the following columns: 
* index: Generic index.
* date: Date of when the game took place.
* season: Season of the NBA when the game took place.
* neutral: Whether the game was neutral or not.
* playoff: playoff status of the matchup
* team1: Name of one of the team involved.
* team2: Name of one of the team involved.
* elo1_pre: Elo rating prior to game for team1.
* elo2_pre: Elo rating prior to game for team2.
* elo_prob1: Probability of team1 winning based on its Elo rating.
* elo_prob2: Probability of team2 winning based on its Elo rating.
* elo1_post: Elo rating after game for team1.
* elo2_post: Elo rating after game for team2.
* carmelo1_pre: Carmelo rating prior to game for team1.
* carmelo2_pre: Carmelo rating prior to game for team2.
* carmelo1_post: Carmelo rating post to game for team1.
* carmelo2_post: Carmelo rating post to game for team2.
* carmelo_prob1: Probability of winning based on team1's Caramelo rating.
* carmelo_prob2: Probability of winning based on team2's Caramelo rating.
* score1: Score obtained by team1.
* score2: Score obtained by team2.

Based on the above columns the relevant columns for our project are:
* Date
* Season
* team1
* team2
* score1
* score2
* playoff

### Chracterization of the columns

##### Date

In [7]:
pd.Series(nba["date"]).unique()

array(['1946-11-01', '1946-11-02', '1946-11-03', ..., '2018-04-26',
       '2018-04-27', '2018-04-28'], dtype=object)

All entries in this column are dates (strings) using the format YYYY-MM-DD.

##### Season

In [8]:
pd.Series(nba["season"]).unique()

array([1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957,
       1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968,
       1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979,
       1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990,
       1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
       2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,
       2013, 2014, 2015, 2016, 2017, 2018])

All entries in this column represent the NBA season in which a game took place, i.e. the year it took place. The years present are from 1947-2018.

##### Neutral

In [9]:
pd.Series(nba["neutral"]).unique()

array([0, 1])

All entries in this column take only values 0 and 1.

#### Playoff  


In [10]:
pd.Series(nba["playoff"]).unique()

array([nan, 't', 'q', 's', 'c', 'f'], dtype=object)

All entries in this column take the values 't','q','s','c','f' or they don't have a value

#### Team1  

In [11]:
pd.Series(nba["team1"]).unique()

array(['TRH', 'CHS', 'PRO', 'STB', 'DTF', 'CLR', 'PIT', 'BOS', 'PHW',
       'NYK', 'WSC', 'BLB', 'INJ', 'FTW', 'ROC', 'MNL', 'TRI', 'INO',
       'WAT', 'AND', 'SHE', 'SYR', 'DNN', 'MLH', 'STL', 'DET', 'CIN',
       'LAL', 'CHP', 'CHZ', 'SFW', 'BAL', 'PHI', 'CHI', 'OAK', 'SDR',
       'INA', 'DNR', 'DLC', 'NOB', 'SEA', 'KEN', 'MNM', 'HSM', 'NJA',
       'PTP', 'ANA', 'ATL', 'MIL', 'PHO', 'NYA', 'MNP', 'LAS', 'MMF',
       'CAR', 'WSA', 'BUF', 'UTS', 'FLO', 'PTC', 'POR', 'VIR', 'MMP',
       'TEX', 'CLE', 'HOU', 'GSW', 'KCO', 'MMT', 'SDA', 'SAA', 'CAP',
       'SSL', 'WSB', 'DNA', 'MMS', 'NOJ', 'SDS', 'KCK', 'IND', 'DEN',
       'NYN', 'SAS', 'NJN', 'SDC', 'UTA', 'DAL', 'LAC', 'SAC', 'CHH',
       'MIA', 'ORL', 'MIN', 'TOR', 'VAN', 'WAS', 'MEM', 'NOP', 'CHO',
       'NOK', 'OKC', 'BRK'], dtype=object)

In [12]:
len(pd.Series(nba["team1"]).unique())

102

All entries in this column take a 3 character string representing the abbreviation of a team's name.

#### Team2  

In [13]:
pd.Series(nba["team2"]).unique()

array(['NYK', 'BOS', 'PIT', 'WSC', 'TRH', 'CHS', 'STB', 'DTF', 'PRO',
       'CLR', 'PHW', 'BLB', 'MNL', 'INJ', 'FTW', 'ROC', 'DNN', 'SHE',
       'TRI', 'SYR', 'INO', 'WAT', 'AND', 'MLH', 'STL', 'DET', 'CIN',
       'LAL', 'CHP', 'CHZ', 'SFW', 'PHI', 'BAL', 'CHI', 'SEA', 'ANA',
       'KEN', 'HSM', 'DNR', 'SDR', 'INA', 'DLC', 'PTP', 'MNM', 'NJA',
       'NOB', 'OAK', 'MIL', 'PHO', 'ATL', 'MMF', 'LAS', 'NYA', 'MNP',
       'WSA', 'CAR', 'CLE', 'FLO', 'PTC', 'TEX', 'VIR', 'POR', 'MMP',
       'BUF', 'UTS', 'GSW', 'HOU', 'MMT', 'KCO', 'SDA', 'CAP', 'SAA',
       'NOJ', 'DNA', 'MMS', 'WSB', 'SSL', 'KCK', 'SDS', 'NYN', 'DEN',
       'SAS', 'IND', 'NJN', 'SDC', 'UTA', 'DAL', 'LAC', 'SAC', 'CHH',
       'MIA', 'MIN', 'ORL', 'VAN', 'TOR', 'WAS', 'MEM', 'NOP', 'CHO',
       'NOK', 'OKC', 'BRK'], dtype=object)

All entries in this column take a 3 character string representing the abbreviation of a team's name.

#### Elo1 Pre   

In [14]:
pd.Series(nba["elo1_pre"]).unique()

array([1300.        , 1295.3092    , 1294.8458    , ..., 1514.04944359,
       1551.42161395, 1686.33918606])

All entries in this column take real numerical values.

#### Elo2 Pre  

In [15]:
pd.Series(nba["elo2_pre"]).unique()

array([1300.        , 1306.7233    , 1293.2767    , ..., 1575.48072491,
       1732.21374124, 1619.55936389])

All entries in this column take real numerical values.

#### Elo Probability 1  

In [16]:
pd.Series(nba["elo_prob1"]).unique()

array([0.640065  , 0.63110119, 0.64893216, ..., 0.55527797, 0.38578214,
       0.72313418])

All entries in this column take real numerical values.

#### Elo Probability 2

In [17]:
pd.Series(nba["elo_prob1"]).unique()

array([0.640065  , 0.63110119, 0.64893216, ..., 0.55527797, 0.38578214,
       0.72313418])

All entries in this column take real numerical values.

#### Elo1 Post 

In [18]:
pd.Series(nba["elo1_post"]).unique()

array([1293.2767    , 1309.6521    , 1305.1542    , ..., 1614.26304141,
       1506.10923004,           nan])

All entries in this column take real numerical values or nan.

#### Elo2 Post  

In [19]:
pd.Series(nba["elo2_post"]).unique()

array([1306.7233    , 1297.0712    , 1294.8458    , ..., 1686.33918606,
       1486.32402581,           nan])

All entries in this column take real numerical values or nan.

#### Carmelo1 Pre  

In [20]:
pd.Series(nba["carmelo1_pre"]).unique()

array([          nan, 1542.664875  , 1564.372491  , ..., 1507.97669924,
       1564.62062015, 1677.6378942 ])

All entries in this column take real numerical values or nan.

#### Carmelo2 Pre  

In [21]:
pd.Series(nba["carmelo2_pre"]).unique()

array([          nan, 1451.827385  , 1732.025482  , ..., 1578.31787597,
       1748.38976757, 1632.27704616])

All entries in this column take real numerical values or nan.

#### Carmelo1 Post  

In [22]:
pd.Series(nba["carmelo1_post"]).unique()

array([          nan, 1521.80999498, 1570.47393668, ..., 1355.61264748,
       1618.40111052, 1508.01099716])

All entries in this column take real numerical values or nan.

#### Carmelo2 Post  

In [23]:
pd.Series(nba["carmelo2_post"]).unique()

array([          nan, 1472.68226502, 1725.92403632, ..., 1748.38976757,
       1677.6378942 , 1488.9283395 ])

All entries in this column take real numerical values or nan.

#### Carmelo Probability 1  

In [24]:
pd.Series(nba["carmelo_prob1"]).unique()

array([       nan, 0.75278039, 0.40307779, ..., 0.52796995, 0.35143533,
       0.68591079])

All entries in this column take real numerical values or nan.

#### Carmelo Probability 2  

In [25]:
pd.Series(nba["carmelo_prob2"]).unique()

array([       nan, 0.24721961, 0.59692221, ..., 0.47203005, 0.64856467,
       0.31408921])

All entries in this column take real numerical values or nan.

#### Score1  

In [26]:
pd.Series(nba["score1"]).unique()

array([ 66.,  63.,  59.,  56.,  33.,  71.,  55.,  49.,  81.,  73.,  76.,
        70.,  62.,  69.,  92.,  68.,  84.,  64.,  85.,  53.,  65.,  83.,
        61.,  54.,  72.,  58.,  82.,  74.,  46.,  67.,  75.,  60.,  79.,
        44.,  80.,  86.,  47.,  88.,  52.,  77.,  78.,  57.,  95.,  48.,
        90.,  50.,  91.,  45.,  87.,  97.,  93.,  89., 109.,  51.,  98.,
        99., 107.,  96., 105., 106., 108., 100.,  94., 103., 117., 104.,
       102., 101., 114., 110., 121., 120., 125., 112., 115., 111., 141.,
       118.,  18., 119., 113., 124., 126., 131., 128., 116., 122., 133.,
       130., 123., 134., 135., 142., 140., 129., 136., 146., 127., 132.,
       138., 139., 137., 173., 151., 143., 152., 147., 150., 144., 155.,
       153., 149., 148., 162., 145., 163., 154., 158., 172., 169., 156.,
       159., 165., 160., 177., 157.,   2., 176., 171., 184., 161., 168.,
        nan])

All entries in this column take integer numerical values or nan.

#### Score2 

In [27]:
pd.Series(nba["score2"]).unique()

array([ 68.,  47.,  53.,  51.,  50.,  60.,  71.,  57.,  75.,  65.,  66.,
        69.,  54.,  46.,  78.,  61.,  70.,  62.,  49.,  73.,  74.,  48.,
        82.,  76.,  67.,  55.,  87.,  40.,  56.,  90.,  64.,  44.,  85.,
        79.,  88.,  52.,  77.,  92.,  63., 103.,  72.,  58.,  59.,  83.,
        81., 104.,  38.,  84.,  43.,  96.,  80., 107., 101.,  91.,  86.,
        89.,  95.,  99., 114.,  98.,  94.,  97.,  93., 100., 110., 109.,
       115., 102., 106., 113., 123., 111., 105., 116., 108.,  19., 112.,
       118., 117., 129., 119., 122., 120., 121., 135., 126., 127., 137.,
       143., 125., 131., 124., 138., 136., 130., 132., 145., 133., 128.,
       139., 134., 141., 140., 144., 142., 148., 151., 153., 149., 147.,
       169., 162., 160., 156., 150., 155., 157., 146., 154.,   0., 166.,
       186., 152., 161.,  nan])

All entries in this column take integer numerical values or nan.

## Air Pollution in the US dataset

Dataset column names.

In [62]:
pollution.head()

Unnamed: 0,Date,Address,State,County,City,O3 Mean,O3 1st Max Value,O3 1st Max Hour,O3 AQI,CO Mean,...,CO 1st Max Hour,CO AQI,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI
0,2000-01-01,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,0.019765,0.04,10,37,0.878947,...,23,25.0,3.0,9.0,21,13.0,19.041667,49.0,19,46
1,2000-01-02,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,0.015882,0.032,10,30,1.066667,...,0,26.0,1.958333,3.0,22,4.0,22.958333,36.0,19,34
2,2000-01-03,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,0.009353,0.016,9,15,1.7625,...,8,28.0,5.25,11.0,19,16.0,38.125,51.0,8,48
3,2000-01-04,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,0.015882,0.033,9,31,1.829167,...,23,34.0,7.083333,16.0,8,23.0,40.26087,74.0,8,72
4,2000-01-05,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,0.007353,0.012,9,11,2.7,...,2,42.0,8.708333,15.0,7,21.0,48.45,61.0,22,58


In [28]:
pollution.columns

Index(['Date', 'Address', 'State', 'County', 'City', 'O3 Mean',
       'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'CO Mean',
       'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI', 'SO2 Mean',
       'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'NO2 Mean',
       'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI'],
      dtype='object')

In [29]:
len(pollution.columns)

21

The Air Pollution in the US dataset includes the following columns: 
* Date
* Address
* State
* County
* City
* O3 Mean
* O3 1st Max Value
* O3 1st Max Hour
* O3 AQI
* CO Mean
* CO 1st Max Value
* CO 1st Max Hour
* CO AQI
* SO2 Mean
* SO2 1st Max Value
* SO2 1st Max Hour
* SO2 AQI
* NO2 Mean
* NO2 1st Max Value
* NO2 1st Max Hour
* NO2 AQI

Based on the above columns the relevant columns for our project are:
* Date
* State
* County
* City
* O3 AQI
* CO AQI
* SO2 AQI
* NO2 AQI

### Chracterization of the necessary columns

#### Date

In [30]:
pd.Series(pollution["Date"]).unique()

array(['2000-01-01', '2000-01-02', '2000-01-03', ..., '2023-09-28',
       '2023-09-29', '2023-09-30'], dtype=object)

All entries in this column are dates (strings) using the format YYYY-MM-DD.

#### State

In [31]:
pd.Series(pollution["State"]).unique()

array(['Arizona', 'California', 'Colorado', 'District Of Columbia',
       'Florida', 'Illinois', 'Indiana', 'Kansas', 'Kentucky',
       'Louisiana', 'Michigan', 'Missouri', 'New Jersey', 'New York',
       'North Carolina', 'Oklahoma', 'Pennsylvania', 'Texas', 'Virginia',
       'Washington', 'Wisconsin', 'Massachusetts', 'Nevada',
       'New Hampshire', 'Tennessee', 'South Carolina', 'Connecticut',
       'Iowa', 'Maine', 'Maryland', 'Arkansas', 'Oregon', 'Wyoming',
       'North Dakota', 'Idaho', 'Ohio', 'Georgia', 'Delaware', 'Hawaii',
       'Minnesota', 'New Mexico', 'Rhode Island', 'South Dakota', 'Utah',
       'Alabama', 'Alaska', 'Mississippi', 'Vermont'], dtype=object)

#### County

In [32]:
pd.Series(pollution["County"]).unique()

array(['Maricopa', 'Pima', 'Contra Costa', 'Imperial', 'Kern',
       'Los Angeles', 'Orange', 'Riverside', 'Sacramento',
       'San Bernardino', 'San Diego', 'San Francisco', 'Santa Barbara',
       'Santa Cruz', 'Solano', 'Ventura', 'Adams', 'District of Columbia',
       'Cook', 'Marion', 'Linn', 'Sumner', 'Wyandotte', 'Boyd', 'Daviess',
       'Fayette', 'Henderson', 'Jefferson', 'McCracken',
       'East Baton Rouge', 'Wayne', 'Saint Louis', 'St. Louis City',
       'Camden', 'Bronx', 'Queens', 'Suffolk', 'Forsyth', 'Mecklenburg',
       'Cherokee', 'Kay', 'Allegheny', 'Beaver', 'Berks', 'Blair',
       'Bucks', 'Cambria', 'Lackawanna', 'Lancaster', 'Lawrence',
       'Montgomery', 'Northampton', 'Philadelphia', 'Washington',
       'Westmoreland', 'York', 'Dallas', 'El Paso', 'Harris', 'Fairfax',
       'Alexandria City', 'King', 'Milwaukee', 'Alameda', 'Saint Clair',
       'Clark', 'Essex', 'Ottawa', 'Kent', 'Hillsborough', 'Haywood',
       'Meigs', 'Fresno', 'Charleston', 'D

#### City 

In [33]:
pd.Series(pollution["City"]).unique()

array(['Phoenix', 'Scottsdale', 'Tucson', 'Concord', 'Bethel Island',
       'San Pablo', 'Pittsburg', 'Calexico', 'Bakersfield', 'Burbank',
       'Los Angeles', 'Long Beach', 'Hawthorne', 'Costa Mesa', 'Rubidoux',
       'Not in a city', 'Arden-Arcade', 'Victorville', 'Chula Vista',
       'San Diego', 'San Francisco', 'Capitan', 'Lompoc', 'Goleta',
       'Vandenberg Air Force Base', 'Davenport', 'Vallejo', 'Welby',
       'Washington', 'Winter Park', 'Chicago', 'Cicero',
       'Calumet City (PU RR name Calumet Park (sta.))',
       'Indianapolis (Remainder)', 'Kansas City', 'Ashland',
       'Lexington-Fayette (corporate name for Lexington)', 'Henderson',
       'Louisville', 'Paducah', 'Baton Rouge', 'Detroit', 'Sunset Hills',
       'Ladue', 'Ferguson', 'St. Ann', 'St. Louis', 'Camden', 'New York',
       'Holtsville', 'Winston-Salem', 'Charlotte', 'Park Hill',
       'Ponca City', 'Pittsburgh', 'Beaver Falls', 'Reading', 'Altoona',
       'Bristol', 'Johnstown', 'Scranton', 'La

#### O3 AQI

In [34]:
pd.Series(pollution["O3 AQI"]).unique()

array([ 37,  30,  15,  31,  11,  23,  22,  19,  20,  14,  16,  13,   9,
        17,  21,  12,  33,  32,  26,  25,  34,  28,  36,  40,  18,  27,
        38,  29,  41,  39,  35,  43,  44,  50,  49,  47,  45,  46,  64,
        42,  84,  74,  58,  61,  67,  51,  48,  71,  80,  54, 101, 108,
       105,  90,  87, 112, 119, 100,  97,  93,  24,  77, 136, 156, 115,
         7,   6,  10, 122, 126,   8, 129,   5, 172,   2,   4, 150, 140,
         3, 133, 143, 208, 182, 179, 214, 159, 174, 197,   0,   1, 164,
       195, 161, 151, 201, 207, 202, 154, 177, 200, 147, 169, 166, 187,
       190, 205, 216, 213, 204, 222, 185, 209, 227, 215, 228, 212, 217,
       210, 192, 203, 206, 221, 220, 224, 218, 219, 237, 226, 225, 223])

All the values are integer values.

#### CO AQI

In [35]:
pd.Series(pollution["CO AQI"]).unique()

array([ 25.,  26.,  28.,  34.,  42.,  41.,  40.,  57.,  32.,  51.,  48.,
        52.,  59.,  54.,  44.,  55.,  47.,  43.,  45.,  38.,  33.,  16.,
        18.,  22.,  35.,  27.,  30.,  36.,  56.,  23.,  19.,  11.,  39.,
         6.,  10.,  13.,  24.,  15.,  20.,   7.,   9.,   8.,  17.,  31.,
        14.,   5.,   3.,   2.,   1.,  49.,   0.,  92., 126.,  61., 103.,
        58.,  50.,  73.,  60.,  96.,  64.,  69.,  67.,  65.,  68.,  70.,
        63.,  62., 136., 159.,  78.,  81.,  91., 183.,  79., 201., 173.,
        80.,  53.,  66.,  72.,  77.,  82., 106., 118.,  86.,  95., 138.,
        85.,  74.,  76.,  98.,  94.,  93.,  84., 113.,  75.,  89., 115.,
        87.,  71.])

All the values are integer values.

#### SO2 AQI

In [36]:
pd.Series(pollution["SO2 AQI"]).unique()

array([ 13.,   4.,  16.,  23.,  21.,  24.,  30.,  26.,  19.,  34.,  37.,
        41.,  31.,  27.,  33.,  20.,  10.,  14.,  29.,  40.,  17.,   9.,
        11.,   3.,   7.,   6.,   0.,   1.,  39.,  36., 100.,  64.,  54.,
        44.,  46.,  65.,  49., 141., 115.,  75.,  43.,  52.,  55.,  62.,
        56.,  59.,  79.,  47.,  71.,  50.,  51.,  61.,  69.,  60.,  76.,
        70.,  89.,  66.,  82.,  91.,  86.,  92., 104.,  81.,  87.,  67.,
       102.,  57., 105.,  84., 136., 133., 106., 124.,  74.,  72., 132.,
        96., 110., 101.,  80.,  97., 112.,  90., 108.,  77.,  95.,  85.,
       117., 113., 107., 128.,  94.,  99., 111., 134., 119., 109., 144.,
       103., 116., 120., 135., 172., 118., 122., 114., 129., 127., 126.,
       150., 173., 155., 177., 180., 168., 123., 137., 161., 130., 121.,
       125., 195., 167., 153., 145., 156., 200., 140., 176., 154., 131.,
       139., 147., 138., 169., 159., 163., 146., 148., 152., 142.])

All the values are integer values.

#### NO2 AQI

In [37]:
pd.Series(pollution["NO2 AQI"]).unique()

array([ 46,  34,  48,  72,  58,  71,  41,  39,  35,  68,  80, 104, 105,
        86, 101,  63,  62,  64,  76,  56,  40,  51,  54,  50,  52,  69,
        91,  65,  88,  45,  67,  61,  37,  66,  44,  31,  57,  55,  47,
        42,  73,  77,  15,  11,  53,  74,   8,   7,   5,   4,   3,   1,
        10,   6,  13,  12,   9, 102,  79,  83,  60,  26,  23,  27,  29,
        33,  24,  43,  32,  21,  38,  25,  28,  49,  36,  78,  70,  30,
        85, 114, 113, 115, 120, 111, 116, 132, 127, 124, 130, 118, 126,
       121, 128, 123, 119, 110, 107, 109, 108, 117, 125, 112,  82, 103,
        97, 106,  19,  90,  16,  22,  14,  17,  20,  18,   2,  84,  87,
        96,  81, 100,  94,  95,  99,  89,  93,  98,   0, 129, 131, 133,
       152, 153, 151, 138])

All the values are integer values.

## US Air Traffic dataset

In [64]:
us_accidents.head()

Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1,Source2,3,2016-02-08 05:46:00,2016-02-08 11:00:00,39.865147,-84.058723,,,0.01,...,False,False,False,False,False,False,Night,Night,Night,Night
1,A-2,Source2,2,2016-02-08 06:07:59,2016-02-08 06:37:59,39.928059,-82.831184,,,0.01,...,False,False,False,False,False,False,Night,Night,Night,Day
2,A-3,Source2,2,2016-02-08 06:49:27,2016-02-08 07:19:27,39.063148,-84.032608,,,0.01,...,False,False,False,False,True,False,Night,Night,Day,Day
3,A-4,Source2,3,2016-02-08 07:23:34,2016-02-08 07:53:34,39.747753,-84.205582,,,0.01,...,False,False,False,False,False,False,Night,Day,Day,Day
4,A-5,Source2,2,2016-02-08 07:39:07,2016-02-08 08:09:07,39.627781,-84.188354,,,0.01,...,False,False,False,False,True,False,Day,Day,Day,Day


Dataset column names.

In [38]:
us_accidents.columns

Index(['ID', 'Source', 'Severity', 'Start_Time', 'End_Time', 'Start_Lat',
       'Start_Lng', 'End_Lat', 'End_Lng', 'Distance(mi)', 'Description',
       'Street', 'City', 'County', 'State', 'Zipcode', 'Country', 'Timezone',
       'Airport_Code', 'Weather_Timestamp', 'Temperature(F)', 'Wind_Chill(F)',
       'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction',
       'Wind_Speed(mph)', 'Precipitation(in)', 'Weather_Condition', 'Amenity',
       'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway',
       'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal',
       'Turning_Loop', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
       'Astronomical_Twilight'],
      dtype='object')

In [39]:
len(us_accidents.columns)

46

The Traffic accidents in the US dataset includes the following columns: 
* ID
* Source
* Severity
* Start_Time
* End_Time
* Start_Lat
* Start_Lng
* End_Lat
* End_Lng
* Distance(mi)
* Description
* Street
* City
* County
* State
* Zipcode
* Country
* Timezone
* Airport_Code
* Weather_Timestamp
* Temperature(F)
* Wind_Chill(F)
* Humidity(%)
* Pressure(in)
* Visibility(mi)
* Wind_Direction
* Wind_Speed(mph)
* Precipitation(in)
* Weather_Condition
* Amenity
* Bump
* Crossing
* Give_Way
* Junction
* No_Exit
* Railway
* Roundabout
* Station
* Stop
* Traffic_Calming
* Traffic_Signal
* Turning_Loop
* Sunrise_Sunset
* Civil_Twilight
* Nautical_Twilight
* Astronomical_Twilight

Based on the above columns the relevant columns for our project are:
* Severity
* Start_Time
* City
* State
* Severity
* Amenity
* Bump 
* Crossing
* Give_way
* Junction
* No_Way
* Railway
* Roundabout
* Station
* Stop
* Traffic_Calming
* Traffic_Signal

### Chracterization of the necessary columns

#### Severity

In [40]:
pd.Series(us_accidents["Severity"]).unique()

array([3, 2, 1, 4])

#### Start_Time

In [41]:
pd.Series(us_accidents["Start_Time"]).unique()

array(['2016-02-08 05:46:00', '2016-02-08 06:07:59',
       '2016-02-08 06:49:27', ..., '2019-08-23 19:11:30',
       '2019-08-23 19:00:21', '2019-08-23 18:52:06'], dtype=object)

#### City

In [42]:
pd.Series(us_accidents["City"]).unique()

array(['Dayton', 'Reynoldsburg', 'Williamsburg', ..., 'Ness City',
       'Clarksdale', 'American Fork-Pleasant Grove'], dtype=object)

#### State

In [43]:
pd.Series(us_accidents["State"]).unique()

array(['OH', 'WV', 'CA', 'FL', 'GA', 'SC', 'NE', 'IA', 'IL', 'MO', 'WI',
       'IN', 'MI', 'NJ', 'NY', 'CT', 'MA', 'RI', 'NH', 'PA', 'KY', 'MD',
       'VA', 'DC', 'DE', 'TX', 'WA', 'OR', 'AL', 'NC', 'AZ', 'TN', 'LA',
       'MN', 'CO', 'OK', 'NV', 'UT', 'KS', 'NM', 'AR', 'MS', 'ME', 'VT',
       'WY', 'ID', 'ND', 'MT', 'SD'], dtype=object)

#### Severity

In [44]:
pd.Series(us_accidents["Severity"]).unique()

array([3, 2, 1, 4])

#### Amenity

In [45]:
pd.Series(us_accidents["Amenity"]).unique()

array([False,  True])

#### Bump

In [46]:
pd.Series(us_accidents["Bump"]).unique()

array([False,  True])

#### Crossing

In [47]:
pd.Series(us_accidents["Crossing"]).unique()

array([False,  True])

#### Give_Way

In [48]:
pd.Series(us_accidents["Give_Way"]).unique()

array([False,  True])

#### Junction 

In [49]:
pd.Series(us_accidents["Junction"]).unique()

array([False,  True])

#### No_Exit

In [50]:
pd.Series(us_accidents["No_Exit"]).unique()

array([False,  True])

#### Railway 

In [51]:
pd.Series(us_accidents["Railway"]).unique()

array([False,  True])

#### Roundabout 

In [52]:
pd.Series(us_accidents["Roundabout"]).unique()

array([False,  True])

#### Station

In [53]:
pd.Series(us_accidents["Station"]).unique()

array([False,  True])

#### Stop

In [54]:
pd.Series(us_accidents["Stop"]).unique()

array([False,  True])

#### Traffic_Calming

In [55]:
pd.Series(us_accidents["Traffic_Calming"]).unique()

array([False,  True])

#### Traffic_Signal

In [56]:
pd.Series(us_accidents["Traffic_Signal"]).unique()

array([False,  True])

## NBA teams dataset

In [65]:
nba_teams.head()

Unnamed: 0,id,full_name,abbreviation,nickname,city,state,year_founded
0,1610612737,Atlanta Hawks,ATL,Hawks,Atlanta,Atlanta,1949.0
1,1610612738,Boston Celtics,BOS,Celtics,Boston,Massachusetts,1946.0
2,1610612739,Cleveland Cavaliers,CLE,Cavaliers,Cleveland,Ohio,1970.0
3,1610612740,New Orleans Pelicans,NOP,Pelicans,New Orleans,Louisiana,2002.0
4,1610612741,Chicago Bulls,CHI,Bulls,Chicago,Illinois,1966.0


Dataset column names.

In [57]:
nba_teams.columns

Index(['id', 'full_name', 'abbreviation', 'nickname', 'city', 'state',
       'year_founded'],
      dtype='object')

The NBA teams dataset includes the following columns: 
* id
* full_name
* abbreviation
* nickname
* city
* state
* year_founded

Based on the above columns the relevant columns for our project are:
* abbreviation
* city
* state

### Chracterization of the necessary columns

#### Abbreviation

In [58]:
pd.Series(nba_teams["abbreviation"]).unique()

array(['ATL', 'BOS', 'CLE', 'NOP', 'CHI', 'DAL', 'DEN', 'GSW', 'HOU',
       'LAC', 'LAL', 'MIA', 'MIL', 'MIN', 'BKN', 'NYK', 'ORL', 'IND',
       'PHI', 'PHX', 'POR', 'SAC', 'SAS', 'OKC', 'TOR', 'UTA', 'MEM',
       'WAS', 'DET', 'CHA'], dtype=object)

In [59]:
len(pd.Series(nba_teams["abbreviation"]).unique())

30

#### City 

In [60]:
pd.Series(nba_teams["city"]).unique()

array(['Atlanta', 'Boston', 'Cleveland', 'New Orleans', 'Chicago',
       'Dallas', 'Denver', 'Golden State', 'Houston', 'Los Angeles',
       'Miami', 'Milwaukee', 'Minnesota', 'Brooklyn', 'New York',
       'Orlando', 'Indiana', 'Philadelphia', 'Phoenix', 'Portland',
       'Sacramento', 'San Antonio', 'Oklahoma City', 'Toronto', 'Utah',
       'Memphis', 'Washington', 'Detroit', 'Charlotte'], dtype=object)

#### State 

In [61]:
pd.Series(nba_teams["state"]).unique()

array(['Atlanta', 'Massachusetts', 'Ohio', 'Louisiana', 'Illinois',
       'Texas', 'Colorado', 'California', 'Florida', 'Wisconsin',
       'Minnesota', 'New York', 'Indiana', 'Pennsylvania', 'Arizona',
       'Oregon', 'Oklahoma', 'Ontario', 'Utah', 'Tennessee',
       'District of Columbia', 'Michigan', 'North Carolina'], dtype=object)