## Data Munging?!
## Exploratory Data Analysis?!

# Data Munging & EDA!!

`Munging` is the colloquial term for going through raw data and processing it to a point of usefulness.<br>
`EDA` is the iterative process of learning about your data and how to process it.

You will find that a significant amount of time in most real-world data science/analysis projects involves acquiring, structuring, cleaning, and otherwise pre-processing your dataset of interest _before_ you can get into the actual analytics.

It is critical to understand the structure and reliability of your data, so the EDA/Munging process also exposes you to the strengths and weaknesses of your data set.

### What might `data munging` involve?
-Finding & cleaning `outliers` <br>
-Finding & cleaning `mis-typed` data elements<br>
-Finding & cleaning `"Nan"` or `unavailable/empty` data elements<br>

### It can also include...
-Combining data sets<br>
-Creating new data elements from existing data<br>
-Discovering new and unusual data problems to solve!

### We are going to look at a data set of NBA player data from the Kaggle website

In [1]:
# first let's import the standard libraries we've been working with

import numpy as np
import pandas as pd
import datascience as ds

Let's start by simply opening the files and reading the data into tables.

The Pandas .read_csv() feature will take a `Comma Separated Values` file and create a DataFrame from it.<br>
We find .csv files frequently. They are concise, and can be easily read & written by a variety of programs.

We can also see that .read_csv() can take many, many `arguments`<br>
As with any feature, use a question mark to call up the syntax thusly: `pd.read_csv?`<br>
For starters, we'll just be providing a file name. <br>
You will need to ensure that the file is in the `filepath` or explicitly provide it.



In [None]:
#placeholder for pd.read_csv?

In [2]:
# Where will our read_csv look?
# os.getcwd() tells us the Current Working Diretory

import os
os.getcwd()

'/Users/michaelk/data8_sports'

In [3]:
# the first file is Season_Stats.csv
# we are implicitly creating a Pandas DataTable called ann_stats with the contents of this file
# "nba-players-stats" is a subdirectory of "data8_sports"


ann_stats = pd.read_csv('nba-players-stats/Seasons_Stats.csv')
ann_table = ds.Table.read_table('nba-players-stats/Seasons_Stats.csv')

In [4]:
ann_table

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,blanl,OWS,DWS,WS,WS/48,blank2,OBPM,DBPM,BPM,VORP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1950,Curly Armstrong,G-F,31,FTW,63,,,,0.368,,0.467,,,,,,,,,,-0.1,3.6,3.5,,,,,,,144,516,0.279,,,,144,516,0.279,0.279,170,241,0.705,,,,176,,,,217,458
1,1950,Cliff Barker,SG,29,INO,49,,,,0.435,,0.387,,,,,,,,,,1.6,0.6,2.2,,,,,,,102,274,0.372,,,,102,274,0.372,0.372,75,106,0.708,,,,109,,,,99,279
2,1950,Leo Barnhorst,SF,25,CHS,67,,,,0.394,,0.259,,,,,,,,,,0.9,2.8,3.6,,,,,,,174,499,0.349,,,,174,499,0.349,0.349,90,129,0.698,,,,140,,,,192,438
3,1950,Ed Bartels,F,24,TOT,15,,,,0.312,,0.395,,,,,,,,,,-0.5,-0.1,-0.6,,,,,,,22,86,0.256,,,,22,86,0.256,0.256,19,34,0.559,,,,20,,,,29,63
4,1950,Ed Bartels,F,24,DNN,13,,,,0.308,,0.378,,,,,,,,,,-0.5,-0.1,-0.6,,,,,,,21,82,0.256,,,,21,82,0.256,0.256,17,31,0.548,,,,20,,,,27,59
5,1950,Ed Bartels,F,24,NYK,2,,,,0.376,,0.75,,,,,,,,,,0.0,0.0,0.0,,,,,,,1,4,0.25,,,,1,4,0.25,0.25,2,3,0.667,,,,0,,,,2,4
6,1950,Ralph Beard,G,22,INO,60,,,,0.422,,0.301,,,,,,,,,,3.6,1.2,4.8,,,,,,,340,936,0.363,,,,340,936,0.363,0.363,215,282,0.762,,,,233,,,,132,895
7,1950,Gene Berce,G-F,23,TRI,3,,,,0.275,,0.313,,,,,,,,,,-0.1,0.0,-0.1,,,,,,,5,16,0.313,,,,5,16,0.313,0.313,0,5,0.0,,,,2,,,,6,10
8,1950,Charlie Black,F-C,28,TOT,65,,,,0.346,,0.395,,,,,,,,,,-2.2,5.0,2.8,,,,,,,226,813,0.278,,,,226,813,0.278,0.278,209,321,0.651,,,,163,,,,273,661
9,1950,Charlie Black,F-C,28,FTW,36,,,,0.362,,0.48,,,,,,,,,,-0.7,2.2,1.5,,,,,,,125,435,0.287,,,,125,435,0.287,0.287,132,209,0.632,,,,75,,,,140,382


In [5]:
# the feature .head() will return the top 5 rows of the DataTable
# .head() will take an integer and return that many rows, if desired.
# this is a good feature to get a quick look at the data as it includes the header line (column titles)

ann_stats.head()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,...,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,...,0.548,,,,20.0,,,,27.0,59.0


### What can we say about this file at first glance?

#### What does each row represent?
#### Is the data complete? sparce?

#### What about the "..." between PER and FT%?


In [6]:
# We have 2 more data files, "player_data.csv" and "players.csv"
# Let's investigate their contents

player_data = pd.read_csv('nba-players-stats/player_data.csv')

In [7]:
player_data.head()

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,"June 24, 1968",Duke University
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235.0,"April 7, 1946",Iowa State University
2,Kareem Abdul-Jabbar,1970,1989,C,7-2,225.0,"April 16, 1947","University of California, Los Angeles"
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,"March 9, 1969",Louisiana State University
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"November 3, 1974",San Jose State University


### player_data.csv appears to contain information about individual players

### Data fields include start/end year, position, height/weight, birthdate, and college attended

### What would it say for a player who didn't attend college?


In [8]:
# Space here to look for one of those records...

In [9]:
# Let's now look at the final file, players.csv

players = pd.read_csv('nba-players-stats/Players.csv')

In [10]:
players.head()

Unnamed: 0.1,Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
0,0,Curly Armstrong,180.0,77.0,Indiana University,1918.0,,
1,1,Cliff Barker,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,2,Leo Barnhorst,193.0,86.0,University of Notre Dame,1924.0,,
3,3,Ed Bartels,196.0,88.0,North Carolina State University,1925.0,,
4,4,Ralph Beard,178.0,79.0,University of Kentucky,1927.0,Hardinsburg,Kentucky


### players.csv seems to have some _similar_ information as player_data...
### But they are distinct

#### We can infer that the height field in player_data.csv "6-9", "6-10" are in English units of feet-inches
#### While the same field in players.csv appears to be in Metric units of centimeters
#### We will look at some individual records to test this hypothesis

#### Notice also, the column title in players.csv of "collage" instead of "college"

#### Finally, notice that players.csv has an index column "Unnamed" and so we see 2 indices on the leftmost columns, one from the file, and an imputed one from the DataFrame object.


### Let's Look at some individual player records!

#### We'll start with one of the all-time legends, Wilt Chamberlain


In [11]:
# Let's look for Wilt the Stilt's record in player_data...

#player_data.loc['Wilt Chamberlain']

In [12]:
# We can see that we can give .loc[] an index and it returns the row associated with that index

player_data.loc[9]

name            Alex Abrines
year_start              2017
year_end                2018
position                 G-F
height                   6-6
weight                   190
birth_date    August 1, 1993
college                  NaN
Name: 9, dtype: object

In [24]:
# We can also give .loc[] a range
# This output should raise some questions...

player_data.loc[250:265]

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college
250,Kent Bazemore,2013,2018,G-F,6-5,201.0,"July 1, 1989",Old Dominion University
251,Ed Beach,1951,1951,F,6-3,200.0,"January 25, 1929",West Virginia University
252,Bradley Beal,2013,2018,G,6-5,207.0,"June 28, 1993",University of Florida
253,Al Beard,1968,1968,C,6-9,200.0,"April 27, 1942",Norfolk State University
254,Butch Beard,1970,1979,G,6-3,185.0,"May 4, 1947",University of Louisville
255,Ralph Beard,1950,1951,G,5-10,175.0,"December 2, 1927",University of Kentucky
256,Charles Beasley,1968,1971,G-F,6-5,190.0,"September 23, 1945",Southern Methodist University
257,Jerome Beasley,2004,2004,F,6-10,237.0,"May 17, 1980",University of North Dakota
258,John Beasley,1968,1974,F-C,6-9,225.0,"February 5, 1944",Texas A&M University
259,Malik Beasley,2017,2018,G,6-5,196.0,"November 26, 1996",Florida State University


### Meanwhile, we still haven't located The Big Dipper's record!

#### Let's explore ways to find it.

#### First, we can see the "name" field is First Last.

#### We use the feature .isin() 


In [25]:
# the column object from a DataFrame is a Series

type(player_data['name'])

pandas.core.series.Series

In [26]:
# When we search that Series for exact matches for "Wilt Chamberlain"
# we get a Series of boolean values back, with "False" in every element except the one with Wilt

#player_data['name'].isin(['Wilt Chamberlain'])
player_data['name'].isin(['Wilt Chamberlain'])[670:680]

670    False
671    False
672    False
673     True
674    False
675    False
676    False
677    False
678    False
679    False
Name: name, dtype: bool

In [27]:
# Putting this all together we see that .loc is now taking a Series of boolean values 
# and returning the rows with values of "True"

player_data.loc[player_data['name'].isin(['Wilt Chamberlain'])]

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college
673,Wilt Chamberlain,1960,1973,C,7-1,275.0,"August 21, 1936",University of Kansas


## All well and good. 
## What if we didn't know he was listed as "Wilt Chamberlain"?

### How can we search that list of players with less than a full name to go off?



In [28]:
# Here is a snippet of code that will return a list of all elements containing our string value of interest

matching = [s for s in player_data['name'] if "Wilt" in s]

In [29]:
# and we find only one other player with "Wilt" in their name

matching

['Wilt Chamberlain', 'Kyle Wiltjer']

In [30]:
# Is this case sensitive?
# inconclusive!

matching2 = [s for s in player_data['name'] if "wilt" in s]
print(matching2)

[]


### We can put these pieces together and return all the records containing our search string of interest

In [31]:
player_data.loc[player_data['name'].isin(s for s in player_data['name'] if "Wilt" in s)]

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college
673,Wilt Chamberlain,1960,1973,C,7-1,275.0,"August 21, 1936",University of Kansas
4446,Kyle Wiltjer,2017,2017,F,6-10,240.0,"October 20, 1992",Gonzaga University


In [32]:
# the player_data['name'] column seems well formed
# but we run into an issue with the similar column in players.csv...players['Player']

#players.loc[players['Player'].isin(s for s in players['Player'] if "Wilt" in s)]

In [33]:
players_list = players['Player']

In [34]:
type(players_list)

pandas.core.series.Series

In [35]:
# note the error
# you can confirm independently that both Wilt C and Kyle Wiltjer have entries in the file

matching3 = [s for s in players_list if "Wilt" in s]

TypeError: argument of type 'float' is not iterable

### What is causing the issue is a "Nan" entry being interpreted as a float type

### We need to clean that value up

### We can use the feature .isna() to identify that troublesome entry
### And, lo and behold, the entire row is blank!

### How do we drop a row?

In [36]:
players.loc[players_list.isna()]

Unnamed: 0.1,Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
223,223,,,,,,,


### Let's look at the surrounding rows.

In [37]:
players.loc[220:225]

Unnamed: 0.1,Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
220,220,D.C. Wilcutt,175.0,70.0,,1926.0,,
221,221,Bob Wood,175.0,70.0,Northern Illinois University,1921.0,,
222,222,Max Zaslofsky,188.0,77.0,St. John's University,1925.0,Brooklyn,New York
223,223,,,,,,,
224,224,Paul Arizin*,193.0,86.0,Villanova University,1928.0,Philadelphia,Pennsylvania
225,225,Ed Beach,190.0,90.0,West Virginia University,1929.0,,


### Let's make sure we have the correct index of the row we want to drop

### It looks like it should be record 223

In [38]:
players.loc[223]

Unnamed: 0     223
Player         NaN
height         NaN
weight         NaN
collage        NaN
born           NaN
birth_city     NaN
birth_state    NaN
Name: 223, dtype: object

In [39]:
# We can investigate the signature of drop using `?`

#players.drop?

In [40]:
# We can always re-import the file.
# Let's see what happens if we simply give .drop[] an index...

players2 = players.drop([223])

In [41]:
# We can look at that set of records now

players2.loc[220:225]

Unnamed: 0.1,Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
220,220,D.C. Wilcutt,175.0,70.0,,1926.0,,
221,221,Bob Wood,175.0,70.0,Northern Illinois University,1921.0,,
222,222,Max Zaslofsky,188.0,77.0,St. John's University,1925.0,Brooklyn,New York
224,224,Paul Arizin*,193.0,86.0,Villanova University,1928.0,Philadelphia,Pennsylvania
225,225,Ed Beach,190.0,90.0,West Virginia University,1929.0,,


In [42]:
# and let's go back to our original goal of finding all the entries with "Wilt" using that same script

players2.loc[players2['Player'].isin(s for s in players2['Player'] if "Wilt" in s)]


Unnamed: 0.1,Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
494,494,Wilt Chamberlain*,216.0,124.0,University of Kansas,1936.0,Philadelphia,Pennsylvania
3918,3918,Kyle Wiltjer,208.0,108.0,Gonzaga University,1992.0,Portland,Oregon


## SUCCESS!!

### But what do we notice about the GOAT?

### In this file it seems there's a star attached to his name

# STOPPING POINT Feb 14


### Let's use the same (s for s in players2['Player'] if "Wilt" in s) construct to search for all the 
### "Wilt"s in the annual statistics file


In [138]:
# We observe the same star after his name...but not Kyle Wiltjer

ann_stats.loc[ann_stats['Player'].isin(s for s in players2['Player'] if "Wilt" in s)]

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
1473,1473,1960.0,Wilt Chamberlain*,C,23.0,PHW,72.0,,3338.0,28.0,...,0.582,,,1941.0,168.0,,,,150.0,2707.0
1593,1593,1961.0,Wilt Chamberlain*,C,24.0,PHW,79.0,,3773.0,27.8,...,0.504,,,2149.0,148.0,,,,130.0,3033.0
1706,1706,1962.0,Wilt Chamberlain*,C,25.0,PHW,80.0,,3882.0,31.7,...,0.613,,,2052.0,192.0,,,,123.0,4029.0
1827,1827,1963.0,Wilt Chamberlain*,C,26.0,SFW,80.0,,3806.0,31.8,...,0.593,,,1946.0,275.0,,,,136.0,3586.0
1962,1962,1964.0,Wilt Chamberlain*,C,27.0,SFW,80.0,,3689.0,31.6,...,0.531,,,1787.0,403.0,,,,182.0,2948.0
2099,2099,1965.0,Wilt Chamberlain*,C,28.0,TOT,73.0,,3301.0,28.6,...,0.464,,,1673.0,250.0,,,,146.0,2534.0
2100,2100,1965.0,Wilt Chamberlain*,C,28.0,SFW,38.0,,1743.0,29.8,...,0.416,,,893.0,117.0,,,,76.0,1480.0
2101,2101,1965.0,Wilt Chamberlain*,C,28.0,PHI,35.0,,1558.0,27.3,...,0.526,,,780.0,133.0,,,,70.0,1054.0
2239,2239,1966.0,Wilt Chamberlain*,C,29.0,PHI,79.0,,3737.0,28.3,...,0.513,,,1943.0,414.0,,,,171.0,2649.0
2366,2366,1967.0,Wilt Chamberlain*,C,30.0,PHI,81.0,,3682.0,26.5,...,0.441,,,1957.0,630.0,,,,143.0,1956.0


### It turns out in Season_Stats.csv and Players.csv, but not player_data.csv, some get a star after their name...why?


In [139]:
players.loc[players['Player'].isin(['Wilt Chamberlain*'])]

Unnamed: 0.1,Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
494,494,Wilt Chamberlain*,216.0,124.0,University of Kansas,1936.0,Philadelphia,Pennsylvania


In [140]:
# For example, the illustrious Dr J.


print(ann_stats.loc[ann_stats['Player'].isin(['Julius Erving*'])])

      Unnamed: 0    Year          Player Pos   Age   Tm     G    GS      MP  \
4745        4745  1977.0  Julius Erving*  SF  26.0  PHI  82.0   NaN  2940.0   
5122        5122  1978.0  Julius Erving*  SF  27.0  PHI  74.0   NaN  2429.0   
5479        5479  1979.0  Julius Erving*  SF  28.0  PHI  78.0   NaN  2802.0   
5831        5831  1980.0  Julius Erving*  SF  29.0  PHI  78.0   NaN  2812.0   
6179        6179  1981.0  Julius Erving*  SF  30.0  PHI  82.0   NaN  2874.0   
6546        6546  1982.0  Julius Erving*  SF  31.0  PHI  81.0  81.0  2789.0   
6923        6923  1983.0  Julius Erving*  SF  32.0  PHI  72.0  72.0  2421.0   
7300        7300  1984.0  Julius Erving*  SF  33.0  PHI  77.0  77.0  2683.0   
7650        7650  1985.0  Julius Erving*  SF  34.0  PHI  78.0  78.0  2535.0   
8018        8018  1986.0  Julius Erving*  SF  35.0  PHI  74.0  74.0  2474.0   
8402        8402  1987.0  Julius Erving*  SF  36.0  PHI  60.0  60.0  1918.0   

       PER  ...    FT%    ORB    DRB    TRB    AST 

In [141]:
# The inimitable Kareem Abdul-Jabbar (what about Lew Alcindor?)

print(players.loc[players['Player'].isin(['Kareem Abdul-Jabbar*'])])

     Unnamed: 0                Player  height  weight  \
789         789  Kareem Abdul-Jabbar*   218.0   102.0   

                                   collage    born birth_city birth_state  
789  University of California, Los Angeles  1947.0   New York    New York  


In [142]:
# And in the other file, no star!

print(player_data.loc[player_data['name'].isin(['Kareem Abdul-Jabbar'])])

                  name  year_start  year_end position height  weight  \
2  Kareem Abdul-Jabbar        1970      1989        C    7-2   225.0   

       birth_date                                college  
2  April 16, 1947  University of California, Los Angeles  


In [143]:
# Note that we appear to be missing "Games Started" data before 1982
# Steals and Blocks before 1974
# and Turnovers before 1978

print(ann_stats.loc[ann_stats['Player'].isin(['Kareem Abdul-Jabbar*'])])

      Unnamed: 0    Year                Player Pos   Age   Tm     G    GS  \
2868        2868  1970.0  Kareem Abdul-Jabbar*   C  22.0  MIL  82.0   NaN   
3070        3070  1971.0  Kareem Abdul-Jabbar*   C  23.0  MIL  82.0   NaN   
3316        3316  1972.0  Kareem Abdul-Jabbar*   C  24.0  MIL  81.0   NaN   
3582        3582  1973.0  Kareem Abdul-Jabbar*   C  25.0  MIL  76.0   NaN   
3852        3852  1974.0  Kareem Abdul-Jabbar*   C  26.0  MIL  81.0   NaN   
4098        4098  1975.0  Kareem Abdul-Jabbar*   C  27.0  MIL  65.0   NaN   
4375        4375  1976.0  Kareem Abdul-Jabbar*   C  28.0  LAL  82.0   NaN   
4650        4650  1977.0  Kareem Abdul-Jabbar*   C  29.0  LAL  82.0   NaN   
5010        5010  1978.0  Kareem Abdul-Jabbar*   C  30.0  LAL  62.0   NaN   
5382        5382  1979.0  Kareem Abdul-Jabbar*   C  31.0  LAL  80.0   NaN   
5727        5727  1980.0  Kareem Abdul-Jabbar*   C  32.0  LAL  82.0   NaN   
6085        6085  1981.0  Kareem Abdul-Jabbar*   C  33.0  LAL  80.0   NaN   

## Let's look at a common issue...duplicated identifiers

## "Name" is a standard identifier in sports data
## And prone to multiple entries

In [144]:
print(player_data.loc[player_data['name'].isin(['Dee Brown'])])

          name  year_start  year_end position height  weight  \
487  Dee Brown        1991      2002        G    6-1   160.0   
488  Dee Brown        2007      2009        G    6-0   185.0   

            birth_date                                     college  
487  November 29, 1968                     Jacksonville University  
488    August 17, 1984  University of Illinois at Urbana-Champaign  


In [145]:
print(players.loc[players['Player'].isin(['Dee Brown'])])

      Unnamed: 0     Player  height  weight                  collage    born  \
2054        2054  Dee Brown   185.0    72.0  Jacksonville University  1968.0   

        birth_city birth_state  
2054  Jacksonville     Florida  


In [146]:
players2.loc[players2['Player'].isin(s for s in players2['Player'] if "Dee" in s)]

Unnamed: 0.1,Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
60,60,Dee Gibson,180.0,79.0,Western Kentucky University,1923.0,,
470,470,Archie Dees,203.0,92.0,Indiana University,1936.0,Ethel,Mississippi
2054,2054,Dee Brown,185.0,72.0,Jacksonville University,1968.0,Jacksonville,Florida


In [58]:
# players2.loc[players2['born'].isin(s for s in players2['born'] if players2['born'].equals(1984.0) in s)]

## Here we observe 2 "Dee Brown"'s in Player_data.csv
## But only 1 in Players.csv! <br> <br>
## This certainly becomes an issue when we want to look at individual summary stats

In [147]:
print(ann_stats.loc[ann_stats['Player'].isin(['Dee Brown'])])

       Unnamed: 0    Year     Player Pos   Age   Tm     G    GS      MP   PER  \
10053       10053  1991.0  Dee Brown  PG  22.0  BOS  82.0   5.0  1945.0  13.2   
10506       10506  1992.0  Dee Brown  PG  23.0  BOS  31.0  20.0   883.0  13.0   
10965       10965  1993.0  Dee Brown  PG  24.0  BOS  80.0  48.0  2254.0  17.2   
11415       11415  1994.0  Dee Brown  SG  25.0  BOS  77.0  76.0  2867.0  16.0   
11890       11890  1995.0  Dee Brown  SG  26.0  BOS  79.0  69.0  2792.0  15.7   
12346       12346  1996.0  Dee Brown  PG  27.0  BOS  65.0  23.0  1591.0  13.7   
12895       12895  1997.0  Dee Brown  PG  28.0  BOS  21.0   2.0   522.0  11.3   
13473       13473  1998.0  Dee Brown  SG  29.0  TOT  72.0  12.0  1719.0  14.2   
13474       13474  1998.0  Dee Brown  SG  29.0  BOS  41.0  10.0   811.0  11.4   
13475       13475  1998.0  Dee Brown  SG  29.0  TOR  31.0   2.0   908.0  16.8   
14016       14016  1999.0  Dee Brown  PG  30.0  TOR  49.0   0.0  1377.0  13.9   
14532       14532  2000.0  D

## You can imagine this becoming thornier...<br>

## What happens when 2 players with the same name are playing at the same time?!<br>

## Let's examine the mysterious Eddie Johnson(s)

In [148]:
print(ann_stats.loc[ann_stats['Player'].isin(['Eddie Johnson'])])

       Unnamed: 0    Year         Player Pos   Age   Tm     G    GS      MP  \
5177         5177  1978.0  Eddie Johnson  SG  22.0  ATL  79.0   NaN  1875.0   
5540         5540  1979.0  Eddie Johnson  SG  23.0  ATL  78.0   NaN  2413.0   
5885         5885  1980.0  Eddie Johnson  SG  24.0  ATL  79.0   NaN  2622.0   
6240         6240  1981.0  Eddie Johnson  SG  25.0  ATL  75.0   NaN  2693.0   
6605         6605  1982.0  Eddie Johnson  SG  26.0  ATL  68.0  57.0  2314.0   
6606         6606  1982.0  Eddie Johnson  SF  22.0  KCK  74.0  27.0  1517.0   
6978         6978  1983.0  Eddie Johnson  SG  27.0  ATL  61.0  57.0  1813.0   
6979         6979  1983.0  Eddie Johnson  SF  23.0  KCK  82.0  82.0  2933.0   
7344         7344  1984.0  Eddie Johnson  SG  28.0  ATL  67.0  43.0  1893.0   
7345         7345  1984.0  Eddie Johnson  SF  24.0  KCK  82.0  82.0  2920.0   
7694         7694  1985.0  Eddie Johnson  SG  29.0  ATL  73.0  66.0  2367.0   
7695         7695  1985.0  Eddie Johnson  SF  25.0  

In [62]:
# Again, we don't even see the duplicate in "Players.csv"

print(players.loc[players['Player'].isin(['Eddie Johnson'])])

      Unnamed: 0         Player  height  weight                     collage  \
1258        1258  Eddie Johnson   203.0    92.0  Tennessee State University   

        born birth_city birth_state  
1258  1944.0    Atlanta     Georgia  


In [63]:
# But we do in "player_data.csv"

print(player_data.loc[player_data['name'].isin(['Eddie Johnson'])])

               name  year_start  year_end position height  weight  \
2010  Eddie Johnson        1978      1987        G    6-2   180.0   
2011  Eddie Johnson        1982      1999      F-G    6-7   215.0   

             birth_date                                     college  
2010  February 24, 1955                           Auburn University  
2011        May 1, 1959  University of Illinois at Urbana-Champaign  


Eddie Johnson (1) from Auburn University
Played 10 NBA seasons, primarily with Hawks
https://en.wikipedia.org/wiki/Eddie_Johnson_(basketball%2C_born_1955)

Eddie Johnson (2) from UI-Chambana
Played 17 NBA seasons, with Kings, Suns, Sonics, Hornets, Pacers, Spurs
https://en.wikipedia.org/wiki/Eddie_Johnson_(basketball%2C_born_1959)

# Common names often generate these anomolies
# Three George Johnson(s)
# ...overlapping!

In [149]:
print(player_data.loc[player_data['name'].isin(['George Johnson'])])

                name  year_start  year_end position height  weight  \
2015  George Johnson        1971      1974        C   6-11   245.0   
2016  George Johnson        1973      1986      C-F   6-11   205.0   
2017  George Johnson        1979      1986      F-C    6-7   210.0   

             birth_date                             college  
2015      June 19, 1947  Stephen F. Austin State University  
2016  December 18, 1948                  Dillard University  
2017   December 8, 1956               St. John's University  


We have 2 records each in our annual stats file for a "George Johnson" in:  <br>
1973 <br>
1974 <br>
1979 <br>
1980 <br>
1981 <br>
1982 <br>
1983 <br>
1985 <br>
1986 <br>


`Record 2015` <br>
George Johnson (1: George E. Johnson) out of Stephen F. Austin State University (TX)
1st Round draft pick (9th overall)
Marginal NBA player  <br>
1971 BAL <br>
1973 HOU <br>
1974 HOU <br>

`Record 2016` <br>
George Johnson (2: George T. Johnson) out of Dillard University (LA)
5th Round draft pick (79th overall)
Solid NBA player (20 MPG ave)
Won Championship with 74-75 Warriors
1973-77 GSW <br>
1977 BUF --> Now the Clippers! <br>
1978-80 NJN (New Jersey Nets, now in Brooklyn) <br>
1981-82 SAS  <br>
1983 ATL (1982-83 season) <br>
1984   not in league 83-84 season <br>
1985 NJN (1984-85 season) <br>
1986 SEA (1985-86 season) <br>

`Record 2017` <br> 
George Johnson (3: George L. Johnson) out of St. Johns University (NY)
1st Round draft pick (12th overall)
Solid NBA player (21 MPG ave)
1978-79 MIL <br>
1979-80 DEN <br>
1980-84 IND <br>
1985 PHI <br>
1986 WSB <br>


### Sometimes fixing these anomolies requires going into the file and modifying entries by hand

### Going from a "pretty good" data set to one that is ready & fit for analysis is an iterative process that takes tenacity and awareness.

### A little detective work and properly devised checks can help here





In [150]:
print(ann_stats.loc[ann_stats['Player'].isin(['George Johnson'])])

      Unnamed: 0    Year          Player Pos   Age   Tm     G    GS      MP  \
3185        3185  1971.0  George Johnson   C  23.0  BAL  24.0   NaN   337.0   
3694        3694  1973.0  George Johnson   C  25.0  HOU  19.0   NaN   169.0   
3695        3695  1973.0  George Johnson   C  24.0  GSW  56.0   NaN   349.0   
3953        3953  1974.0  George Johnson   C  26.0  HOU  26.0   NaN   238.0   
3954        3954  1974.0  George Johnson   C  25.0  GSW  66.0   NaN  1291.0   
4222        4222  1975.0  George Johnson   C  26.0  GSW  82.0   NaN  1439.0   
4494        4494  1976.0  George Johnson   C  27.0  GSW  82.0   NaN  1745.0   
4808        4808  1977.0  George Johnson   C  28.0  TOT  78.0   NaN  1652.0   
4809        4809  1977.0  George Johnson   C  28.0  GSW  39.0   NaN   597.0   
4810        4810  1977.0  George Johnson   C  28.0  BUF  39.0   NaN  1055.0   
5178        5178  1978.0  George Johnson   C  29.0  NJN  81.0   NaN  2411.0   
5541        5541  1979.0  George Johnson   C  30.0  

### Here's how we look at all the player's stats from a given year

### Notice multiple entries for players who switch teams. One entry each for the teams they played for and a summary 'TOT' line.

In [151]:
print(ann_stats.loc[ann_stats['Year'] == 1977.0])

      Unnamed: 0    Year                Player    Pos   Age   Tm     G  GS  \
4649        4649  1977.0       Zaid Abdul-Aziz      C  30.0  BUF  22.0 NaN   
4650        4650  1977.0  Kareem Abdul-Jabbar*      C  29.0  LAL  82.0 NaN   
4651        4651  1977.0         Tom Abernethy     SF  22.0  LAL  70.0 NaN   
4652        4652  1977.0           Alvan Adams      C  22.0  PHO  72.0 NaN   
4653        4653  1977.0             Don Adams     SF  29.0  BUF  77.0 NaN   
4654        4654  1977.0          Lucius Allen     PG  29.0  LAL  78.0 NaN   
4655        4655  1977.0       Jerome Anderson     SG  23.0  IND  27.0 NaN   
4656        4656  1977.0       Tiny Archibald*     PG  28.0  NYN  34.0 NaN   
4657        4657  1977.0               Jim Ard      C  28.0  BOS  63.0 NaN   
4658        4658  1977.0          Bird Averitt     SG  24.0  BUF  75.0 NaN   
4659        4659  1977.0         Dennis Awtrey      C  28.0  PHO  72.0 NaN   
4660        4660  1977.0           Mike Bantom  PF-SF  25.0  TOT

In [152]:
ann_stats.loc[(ann_stats['Year'] == 1977.0)].get('Tm').unique()

array(['BUF', 'LAL', 'PHO', 'IND', 'NYN', 'BOS', 'TOT', 'SEA', 'ATL',
       'DET', 'PHI', 'KCK', 'GSW', 'NYK', 'DEN', 'NOJ', 'WSB', 'CHI',
       'CLE', 'MIL', 'SAS', 'POR', 'HOU'], dtype=object)

In [153]:
357/22

16.227272727272727

In [154]:
print(ann_stats.loc[ann_stats['Year'] == 2017.0])

       Unnamed: 0    Year                 Player Pos   Age   Tm     G    GS  \
24096       24096  2017.0           Alex Abrines  SG  23.0  OKC  68.0   6.0   
24097       24097  2017.0             Quincy Acy  PF  26.0  TOT  38.0   1.0   
24098       24098  2017.0             Quincy Acy  PF  26.0  DAL   6.0   0.0   
24099       24099  2017.0             Quincy Acy  PF  26.0  BRK  32.0   1.0   
24100       24100  2017.0           Steven Adams   C  23.0  OKC  80.0  80.0   
24101       24101  2017.0          Arron Afflalo  SG  31.0  SAC  61.0  45.0   
24102       24102  2017.0          Alexis Ajinca   C  28.0  NOP  39.0  15.0   
24103       24103  2017.0           Cole Aldrich   C  28.0  MIN  62.0   0.0   
24104       24104  2017.0      LaMarcus Aldridge  PF  31.0  SAS  72.0  72.0   
24105       24105  2017.0            Lavoy Allen  PF  27.0  IND  61.0   5.0   
24106       24106  2017.0             Tony Allen  SG  35.0  MEM  71.0  66.0   
24107       24107  2017.0        Al-Farouq Aminu  SF

In [82]:
ann_stats.axes

[RangeIndex(start=0, stop=24691, step=1),
 Index(['Unnamed: 0', 'Year', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP',
        'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%',
        'BLK%', 'TOV%', 'USG%', 'blanl', 'OWS', 'DWS', 'WS', 'WS/48', 'blank2',
        'OBPM', 'DBPM', 'BPM', 'VORP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%',
        '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB',
        'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
       dtype='object')]

In [155]:
595/30

19.833333333333332

In [156]:
print(ann_stats.loc[ann_stats['Player'].str.contains("Johnson")])

ValueError: cannot index with vector containing NA / NaN values

In [157]:
guard_series = ann_stats['Pos'].str.contains("G")

In [158]:
guard_series

0         True
1         True
2        False
3        False
4        False
5        False
6         True
7         True
8        False
9        False
10       False
11        True
12       False
13       False
14        True
15       False
16        True
17       False
18        True
19        True
20        True
21       False
22       False
23       False
24        True
25        True
26        True
27        True
28       False
29       False
         ...  
24661     True
24662     True
24663     True
24664    False
24665    False
24666    False
24667     True
24668     True
24669     True
24670    False
24671    False
24672    False
24673    False
24674    False
24675    False
24676    False
24677    False
24678    False
24679    False
24680    False
24681     True
24682     True
24683     True
24684     True
24685    False
24686    False
24687    False
24688    False
24689    False
24690    False
Name: Pos, Length: 24691, dtype: object

In [159]:
guard_series.hasnans

True

In [161]:
guard_series.fillna(False)

0         True
1         True
2        False
3        False
4        False
5        False
6         True
7         True
8        False
9        False
10       False
11        True
12       False
13       False
14        True
15       False
16        True
17       False
18        True
19        True
20        True
21       False
22       False
23       False
24        True
25        True
26        True
27        True
28       False
29       False
         ...  
24661     True
24662     True
24663     True
24664    False
24665    False
24666    False
24667     True
24668     True
24669     True
24670    False
24671    False
24672    False
24673    False
24674    False
24675    False
24676    False
24677    False
24678    False
24679    False
24680    False
24681     True
24682     True
24683     True
24684     True
24685    False
24686    False
24687    False
24688    False
24689    False
24690    False
Name: Pos, Length: 24691, dtype: bool

In [162]:
guard_series.hasnans

True

In [163]:
player_nametest1 = ann_stats['Player'].str.contains("Johnson")

In [164]:
player_nametest1.hasnans

True

In [76]:
type(player_nametest1)

pandas.core.series.Series

In [77]:
player_nametest1.shape

(24691,)

In [78]:
print(player_nametest1.get('False'))

None


In [165]:
player_start_NaNs = ann_stats['Player'].index[ann_stats['GS'].apply(np.isnan)]

In [166]:
player_start_NaNs

Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            18742, 19338, 19921, 20500, 21126, 21678, 22252, 22864, 23516,
            24095],
           dtype='int64', length=6458)

In [80]:
type(ann_stats['Player'])

pandas.core.series.Series

In [81]:
guard_series.to_csv("boolean_list_of_guards.csv")

  """Entry point for launching an IPython kernel.


In [82]:
player_start_NaNs

Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            18742, 19338, 19921, 20500, 21126, 21678, 22252, 22864, 23516,
            24095],
           dtype='int64', length=6458)

In [167]:
player_names

0          Curly Armstrong
1             Cliff Barker
2            Leo Barnhorst
3               Ed Bartels
4               Ed Bartels
5               Ed Bartels
6              Ralph Beard
7               Gene Berce
8            Charlie Black
9            Charlie Black
10           Charlie Black
11             Nelson Bobb
12         Jake Bornheimer
13            Vince Boryla
14               Don Boven
15           Harry Boykoff
16             Joe Bradley
17             Bob Brannum
18              Carl Braun
19           Frankie Brian
20        Price Brookfield
21               Bob Brown
22              Jim Browne
23              Walt Budko
24          Jack Burmaster
25            Tommy Byrnes
26            Bill Calhoun
27             Don Carlson
28           Bob Carpenter
29             Jake Carter
               ...        
24661       Deron Williams
24662       Deron Williams
24663       Deron Williams
24664     Derrick Williams
24665     Derrick Williams
24666     Derrick Williams
2

In [168]:
player_names = pd.Series(ann_stats['Player'])

In [171]:
print(ann_stats.loc[ann_stats['Player'].isin(['Cody Zeller'])])

       Unnamed: 0    Year       Player Pos   Age   Tm     G    GS      MP  \
22862       22862  2014.0  Cody Zeller   C  21.0  CHA  82.0   3.0  1416.0   
23514       23514  2015.0  Cody Zeller   C  22.0  CHO  62.0  45.0  1487.0   
24093       24093  2016.0  Cody Zeller   C  23.0  CHO  73.0  60.0  1774.0   
24686       24686  2017.0  Cody Zeller  PF  24.0  CHO  62.0  58.0  1725.0   

        PER  ...    FT%    ORB    DRB    TRB    AST   STL   BLK   TOV     PF  \
22862  13.1  ...  0.730  118.0  235.0  353.0   92.0  40.0  41.0  87.0  170.0   
23514  14.1  ...  0.774   97.0  265.0  362.0  100.0  34.0  49.0  62.0  156.0   
24093  16.1  ...  0.754  138.0  317.0  455.0   71.0  57.0  63.0  68.0  204.0   
24686  16.7  ...  0.679  135.0  270.0  405.0   99.0  62.0  58.0  65.0  189.0   

         PTS  
22862  490.0  
23514  472.0  
24093  638.0  
24686  639.0  

[4 rows x 53 columns]


In [169]:
type(player_names)

pandas.core.series.Series

In [170]:
type(player_names[6])

str

In [87]:
ann_stats.axes

[RangeIndex(start=0, stop=24691, step=1),
 Index(['Unnamed: 0', 'Year', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP',
        'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%',
        'BLK%', 'TOV%', 'USG%', 'blanl', 'OWS', 'DWS', 'WS', 'WS/48', 'blank2',
        'OBPM', 'DBPM', 'BPM', 'VORP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%',
        '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB',
        'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
       dtype='object')]

## Let's look at some individual season performances
## We can use .loc[] to pull out performances of interest
## Here we get players who scored more than 2500 pts since 1960

## Note we see most of them have "stars"

In [172]:
print(ann_stats.loc[(ann_stats['Year'] >= 1960.0) & (ann_stats['PTS'] >= 2500)])

       Unnamed: 0    Year                Player Pos   Age   Tm     G    GS  \
1473         1473  1960.0     Wilt Chamberlain*   C  23.0  PHW  72.0   NaN   
1582         1582  1961.0         Elgin Baylor*  SF  26.0  LAL  73.0   NaN   
1593         1593  1961.0     Wilt Chamberlain*   C  24.0  PHW  79.0   NaN   
1706         1706  1962.0     Wilt Chamberlain*   C  25.0  PHW  80.0   NaN   
1812         1812  1963.0         Elgin Baylor*  SF  28.0  LAL  80.0   NaN   
1827         1827  1963.0     Wilt Chamberlain*   C  26.0  SFW  80.0   NaN   
1962         1962  1964.0     Wilt Chamberlain*   C  27.0  SFW  80.0   NaN   
2099         2099  1965.0     Wilt Chamberlain*   C  28.0  TOT  73.0   NaN   
2239         2239  1966.0     Wilt Chamberlain*   C  29.0  PHI  79.0   NaN   
2355         2355  1967.0           Rick Barry*  SF  22.0  SFW  78.0   NaN   
3070         3070  1971.0  Kareem Abdul-Jabbar*   C  23.0  MIL  82.0   NaN   
3316         3316  1972.0  Kareem Abdul-Jabbar*   C  24.0  MIL  

## We can also of course filter on other stats, like assists

In [174]:
print(ann_stats.loc[(ann_stats['Year'] >= 1960.0) & (ann_stats['AST'] >= 900)])

       Unnamed: 0    Year           Player Pos   Age   Tm     G    GS      MP  \
2445         2445  1967.0     Guy Rodgers*  PG  31.0  CHI  81.0   NaN  3063.0   
3588         3588  1973.0  Tiny Archibald*  PG  24.0  KCO  80.0   NaN  3681.0   
5621         5621  1979.0     Kevin Porter  PG  28.0  DET  82.0   NaN  3064.0   
7431         7431  1984.0       Norm Nixon  PG  28.0  SDC  82.0  82.0  3053.0   
7498         7498  1984.0    Isiah Thomas*  PG  22.0  DET  82.0  82.0  3007.0   
7699         7699  1985.0   Magic Johnson*  PG  25.0  LAL  77.0  77.0  2781.0   
7857         7857  1985.0    Isiah Thomas*  PG  23.0  DET  81.0  81.0  3089.0   
8082         8082  1986.0   Magic Johnson*  PG  26.0  LAL  72.0  70.0  2578.0   
8459         8459  1987.0   Magic Johnson*  PG  27.0  LAL  80.0  80.0  2904.0   
9018         9018  1988.0   John Stockton*  PG  25.0  UTA  82.0  79.0  2842.0   
9282         9282  1989.0    Kevin Johnson  PG  22.0  PHO  81.0  81.0  3179.0   
9283         9283  1989.0   

# Or, rebounds

In [175]:
print(ann_stats.loc[(ann_stats['Year'] >= 1960.0) & (ann_stats['TRB'] >= 1500)])

       Unnamed: 0    Year             Player Pos   Age   Tm     G    GS  \
1473         1473  1960.0  Wilt Chamberlain*   C  23.0  PHW  72.0   NaN   
1554         1554  1960.0      Bill Russell*   C  25.0  BOS  74.0   NaN   
1593         1593  1961.0  Wilt Chamberlain*   C  24.0  PHW  79.0   NaN   
1655         1655  1961.0        Bob Pettit*  PF  28.0  STL  76.0   NaN   
1668         1668  1961.0      Bill Russell*   C  26.0  BOS  78.0   NaN   
1687         1687  1962.0      Walt Bellamy*   C  22.0  CHP  79.0   NaN   
1706         1706  1962.0  Wilt Chamberlain*   C  25.0  PHW  80.0   NaN   
1784         1784  1962.0      Bill Russell*   C  27.0  BOS  76.0   NaN   
1827         1827  1963.0  Wilt Chamberlain*   C  26.0  SFW  80.0   NaN   
1913         1913  1963.0      Bill Russell*   C  28.0  BOS  78.0   NaN   
1962         1962  1964.0  Wilt Chamberlain*   C  27.0  SFW  80.0   NaN   
2049         2049  1964.0      Bill Russell*   C  29.0  BOS  78.0   NaN   
2099         2099  1965.0

## Let's explore an individual player a bit deeper
## I've got Moses Malone...any others of choice?

In [176]:
type(ann_stats.loc[ann_stats['Player'].isin(['Moses Malone*'])])

pandas.core.frame.DataFrame

## Here we're extracting MM's records into a new DataFrame object

In [177]:
moses_malone = ann_stats.loc[ann_stats['Player'].isin(['Moses Malone*'])]

In [178]:
moses_malone

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
4846,4846,1977.0,Moses Malone*,C,21.0,TOT,82.0,,2506.0,19.8,...,0.693,437.0,635.0,1072.0,89.0,67.0,181.0,,275.0,1083.0
4847,4847,1977.0,Moses Malone*,C,21.0,BUF,2.0,,6.0,-0.3,...,,0.0,1.0,1.0,0.0,0.0,0.0,,1.0,0.0
4848,4848,1977.0,Moses Malone*,C,21.0,HOU,80.0,,2500.0,19.8,...,0.693,437.0,634.0,1071.0,89.0,67.0,181.0,,274.0,1083.0
5218,5218,1978.0,Moses Malone*,C,22.0,HOU,59.0,,2107.0,21.2,...,0.718,380.0,506.0,886.0,31.0,48.0,76.0,220.0,179.0,1144.0
5580,5580,1979.0,Moses Malone*,C,23.0,HOU,82.0,,3390.0,23.7,...,0.739,587.0,857.0,1444.0,147.0,79.0,119.0,326.0,223.0,2031.0
5942,5942,1980.0,Moses Malone*,C,24.0,HOU,82.0,,3140.0,24.1,...,0.719,573.0,617.0,1190.0,147.0,80.0,107.0,300.0,210.0,2119.0
6298,6298,1981.0,Moses Malone*,C,25.0,HOU,80.0,,3245.0,25.1,...,0.757,474.0,706.0,1180.0,141.0,83.0,150.0,308.0,223.0,2222.0
6672,6672,1982.0,Moses Malone*,C,26.0,HOU,81.0,81.0,3398.0,26.8,...,0.762,558.0,630.0,1188.0,142.0,76.0,125.0,294.0,208.0,2520.0
7045,7045,1983.0,Moses Malone*,C,27.0,PHI,78.0,78.0,2922.0,25.1,...,0.761,445.0,749.0,1194.0,101.0,89.0,157.0,264.0,206.0,1908.0
7403,7403,1984.0,Moses Malone*,C,28.0,PHI,71.0,71.0,2613.0,21.8,...,0.75,352.0,598.0,950.0,96.0,71.0,110.0,250.0,188.0,1609.0


## Note we have a "TOT" row...
## If we keep that in there, we will double-count those stats
## Let's drop it...

In [187]:
# moses_malone.loc['Tm' == "TOT"]
# print(moses_malone.loc['Tm' == "TOT"])
moses_malone.loc[moses_malone['Tm'].isin(['TOT'])]

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
4846,4846,1977.0,Moses Malone*,C,21.0,TOT,82.0,,2506.0,19.8,...,0.693,437.0,635.0,1072.0,89.0,67.0,181.0,,275.0,1083.0


In [180]:
moses_malone2 = moses_malone.drop(4846)

## Here's a snippet to drop code using the indices returned from the search

In [189]:
moses3 = moses_malone.drop(moses_malone.loc[moses_malone['Tm'].isin(['TOT'])].index)

In [190]:
moses3

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
4847,4847,1977.0,Moses Malone*,C,21.0,BUF,2.0,,6.0,-0.3,...,,0.0,1.0,1.0,0.0,0.0,0.0,,1.0,0.0
4848,4848,1977.0,Moses Malone*,C,21.0,HOU,80.0,,2500.0,19.8,...,0.693,437.0,634.0,1071.0,89.0,67.0,181.0,,274.0,1083.0
5218,5218,1978.0,Moses Malone*,C,22.0,HOU,59.0,,2107.0,21.2,...,0.718,380.0,506.0,886.0,31.0,48.0,76.0,220.0,179.0,1144.0
5580,5580,1979.0,Moses Malone*,C,23.0,HOU,82.0,,3390.0,23.7,...,0.739,587.0,857.0,1444.0,147.0,79.0,119.0,326.0,223.0,2031.0
5942,5942,1980.0,Moses Malone*,C,24.0,HOU,82.0,,3140.0,24.1,...,0.719,573.0,617.0,1190.0,147.0,80.0,107.0,300.0,210.0,2119.0
6298,6298,1981.0,Moses Malone*,C,25.0,HOU,80.0,,3245.0,25.1,...,0.757,474.0,706.0,1180.0,141.0,83.0,150.0,308.0,223.0,2222.0
6672,6672,1982.0,Moses Malone*,C,26.0,HOU,81.0,81.0,3398.0,26.8,...,0.762,558.0,630.0,1188.0,142.0,76.0,125.0,294.0,208.0,2520.0
7045,7045,1983.0,Moses Malone*,C,27.0,PHI,78.0,78.0,2922.0,25.1,...,0.761,445.0,749.0,1194.0,101.0,89.0,157.0,264.0,206.0,1908.0
7403,7403,1984.0,Moses Malone*,C,28.0,PHI,71.0,71.0,2613.0,21.8,...,0.75,352.0,598.0,950.0,96.0,71.0,110.0,250.0,188.0,1609.0
7747,7747,1985.0,Moses Malone*,C,29.0,PHI,79.0,79.0,2957.0,22.5,...,0.815,385.0,646.0,1031.0,130.0,67.0,123.0,286.0,216.0,1941.0


In [181]:
moses_malone2

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
4847,4847,1977.0,Moses Malone*,C,21.0,BUF,2.0,,6.0,-0.3,...,,0.0,1.0,1.0,0.0,0.0,0.0,,1.0,0.0
4848,4848,1977.0,Moses Malone*,C,21.0,HOU,80.0,,2500.0,19.8,...,0.693,437.0,634.0,1071.0,89.0,67.0,181.0,,274.0,1083.0
5218,5218,1978.0,Moses Malone*,C,22.0,HOU,59.0,,2107.0,21.2,...,0.718,380.0,506.0,886.0,31.0,48.0,76.0,220.0,179.0,1144.0
5580,5580,1979.0,Moses Malone*,C,23.0,HOU,82.0,,3390.0,23.7,...,0.739,587.0,857.0,1444.0,147.0,79.0,119.0,326.0,223.0,2031.0
5942,5942,1980.0,Moses Malone*,C,24.0,HOU,82.0,,3140.0,24.1,...,0.719,573.0,617.0,1190.0,147.0,80.0,107.0,300.0,210.0,2119.0
6298,6298,1981.0,Moses Malone*,C,25.0,HOU,80.0,,3245.0,25.1,...,0.757,474.0,706.0,1180.0,141.0,83.0,150.0,308.0,223.0,2222.0
6672,6672,1982.0,Moses Malone*,C,26.0,HOU,81.0,81.0,3398.0,26.8,...,0.762,558.0,630.0,1188.0,142.0,76.0,125.0,294.0,208.0,2520.0
7045,7045,1983.0,Moses Malone*,C,27.0,PHI,78.0,78.0,2922.0,25.1,...,0.761,445.0,749.0,1194.0,101.0,89.0,157.0,264.0,206.0,1908.0
7403,7403,1984.0,Moses Malone*,C,28.0,PHI,71.0,71.0,2613.0,21.8,...,0.75,352.0,598.0,950.0,96.0,71.0,110.0,250.0,188.0,1609.0
7747,7747,1985.0,Moses Malone*,C,29.0,PHI,79.0,79.0,2957.0,22.5,...,0.815,385.0,646.0,1031.0,130.0,67.0,123.0,286.0,216.0,1941.0


In [120]:
moses_malone2.columns

Index(['Unnamed: 0', 'Year', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP',
       'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%',
       'BLK%', 'TOV%', 'USG%', 'blanl', 'OWS', 'DWS', 'WS', 'WS/48', 'blank2',
       'OBPM', 'DBPM', 'BPM', 'VORP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%',
       '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB',
       'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
      dtype='object')

## A simple example of a very powerful python feature <br>

## Applied functions

## Here we use: 
## `for x in {set of variables} ...  .sum()`
## This process applies .sum() to all the elements in our set of variables

In [182]:
for x in ['G', 'MP', 'PTS', 'BLK', 'TRB', 'AST']:
    print(x, moses_malone2[x].sum())

G 1329.0
MP 45071.0
PTS 27409.0
BLK 1733.0
TRB 16212.0
AST 1796.0


In [183]:
print ("Pts/Gm = ", 28492 / 1411)

Pts/Gm =  20.19277108433735


In [184]:
print ("Rebounds/Gm = ", 17284 / 1411)

Rebounds/Gm =  12.249468462083628


In [185]:
print ("Blocks/Gm = ", 1914 / 1411)

Blocks/Gm =  1.3564847625797307


# STOP FRIDAY FEB 21