# COGS 108 - Final Project

## Important

- ONE, and only one, member of your group should upload this notebook to TritonED. 
- Each member of the group will receive the same grade on this assignment. 
- Keep the file name the same: submit the file 'FinalProject.ipynb'.
- Only upload the .ipynb file to TED, do not upload any associted data. Make sure that for cells in which you want graders to see output that these cells have been executed.

## Group Members: Fill in the Student IDs of each group member here

Replace the lines below to list each persons full student ID, ucsd email and full name.

-  A12380391, Eddy Ambing, <eambing@ucsd.edu>
-  A13514859, Trevor Mazza, <tmazza@ucsd.edu> 
-  A12916152, Franklin Li, <frl003@ucsd.edu>
-  A12122669, Zafrin Dhali, <zdhali@ucsd.edu>
-  A14036039, Albert Chiu, <alc204@ucsd.edu>
-  A11625738, Elias Solorzano, <e1solorz@ucsd.edu>



Start your project here.

## PROJECT OUTLINE FROM RUBRIC
### Remove before submission


- Introduction and Background
- Data Description
- Data Cleaning/Pre-processing
- Data Visualization
- Data Analysis and Results
- Privacy/Ethics Considerations
- Conclusions and Discussion


# Data Description

The data was cleaned for us by Sean Lahman and is organized into different types of datasets based on different types of information. It is in a well-organized series of CSV files, and the data is attributed to players with a unique identifier when relevant, and year is always specified.

# Data Cleaning/Pre-processing

Due to the extensive length of time which these statistics cover, some data relationships could potentially be confounded by other variables, such as the percieved worth of playing baseball as a profession. The data set is futher limited based on what statistics are available, primarily salary data, which only goes back to 1985.

# Privacy/Ethics Considerations


The baseball database is copyright 1996-2018 by Sean Lahman. The data set is licensed under [CC-BY-SA 3.0 Unported](https://creativecommons.org/licenses/by-sa/3.0/). This license grants us the permission to share and adapt data in any form as long as we comply with the conditions stated by the license, namely that we share the source of the data and release any changes to the dataset that we make under the same license [or another compatible license](https://creativecommons.org/share-your-work/licensing-considerations/ProjectProposal.ipynb).

There is not any privacy concerns with regards to this data set or our usage of it. All data is historical records about professional players whose performance is public record. There is not any reasonable expectation of privacy regarding these statistics. It is not likely for there to be any biases in this data set, other than the fact that the collection of statistics has gotten more sophisticated and reliable over time. 

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# scikit-learn imports
#   SVM (Support Vector Machine) classifer 
#   Vectorizer, which that transforms text data into bag-of-words feature
#   TF-IDF Vectorizer that first removes widely used words in the dataset and then transforms test data
#   Metrics functions to evaluate performance
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, precision_recall_fscore_support

import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

# Feel free to add more if you think we need it

In [2]:
salaries = pd.read_csv('baseballdatabank-2019.2/core/Salaries.csv')
salaries = salaries[['playerID', 'salary', 'yearID', 'teamID', 'lgID']]
salaries
#print(salaries[salaries['playerID'] == 'barkele01'])

Unnamed: 0,playerID,salary,yearID,teamID,lgID
0,barkele01,870000,1985,ATL,NL
1,bedrost01,550000,1985,ATL,NL
2,benedbr01,545000,1985,ATL,NL
3,campri01,633333,1985,ATL,NL
4,ceronri01,625000,1985,ATL,NL
5,chambch01,800000,1985,ATL,NL
6,dedmoje01,150000,1985,ATL,NL
7,forstte01,483333,1985,ATL,NL
8,garbege01,772000,1985,ATL,NL
9,harpete01,250000,1985,ATL,NL


In [3]:
batting = pd.read_csv('baseballdatabank-2019.2/core/BattingPost.csv')
batting

Unnamed: 0,yearID,round,playerID,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,1884,WS,becanbu01,NY4,AA,1,2,0,1,0,...,0,0,,0,0,0,,,,
1,1884,WS,bradyst01,NY4,AA,3,10,1,0,0,...,0,0,,0,1,0,,,,
2,1884,WS,carrocl01,PRO,NL,3,10,2,1,0,...,1,0,,1,1,0,,,,
3,1884,WS,dennyje01,PRO,NL,3,9,3,4,0,...,2,0,,0,3,0,,,,
4,1884,WS,esterdu01,NY4,AA,3,10,0,3,1,...,0,1,,0,3,0,,,,
5,1884,WS,farreja02,PRO,NL,3,9,3,4,2,...,0,1,,0,0,0,,,,
6,1884,WS,forstto01,NY4,AA,1,3,0,0,0,...,0,0,,0,1,0,,,,
7,1884,WS,gilliba01,PRO,NL,3,9,3,4,2,...,2,0,,0,1,0,,,,
8,1884,WS,hinespa01,PRO,NL,3,8,5,2,0,...,1,2,,3,0,0,,,,
9,1884,WS,irwinar01,PRO,NL,3,9,3,2,0,...,2,0,,0,2,0,,,,


In [4]:
appearances = salaries = pd.read_csv('baseballdatabank-2019.2/core/Appearances.csv')
appearances

Unnamed: 0,yearID,teamID,lgID,playerID,G_all,GS,G_batting,G_defense,G_p,G_c,...,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_of,G_dh,G_ph,G_pr
0,1871,TRO,,abercda01,1,1.0,1,1.0,0,0,...,0,0,1,0,0,0,0,0.0,0.0,0.0
1,1871,RC1,,addybo01,25,25.0,25,25.0,0,0,...,22,0,3,0,0,0,0,0.0,0.0,0.0
2,1871,CL1,,allisar01,29,29.0,29,29.0,0,0,...,2,0,0,0,29,0,29,0.0,0.0,0.0
3,1871,WS3,,allisdo01,27,27.0,27,27.0,0,27,...,0,0,0,0,0,0,0,0.0,0.0,0.0
4,1871,RC1,,ansonca01,25,25.0,25,25.0,0,5,...,2,20,0,1,0,0,1,0.0,0.0,0.0
5,1871,FW1,,armstbo01,12,12.0,12,12.0,0,0,...,0,0,0,0,11,1,12,0.0,0.0,0.0
6,1871,RC1,,barkeal01,1,1.0,1,1.0,0,0,...,0,0,0,1,0,0,1,0.0,0.0,0.0
7,1871,BS1,,barnero01,31,31.0,31,31.0,0,0,...,16,0,15,0,0,0,0,0.0,0.0,0.0
8,1871,FW1,,barrebi01,1,1.0,1,1.0,0,1,...,0,1,0,0,0,0,0,0.0,0.0,0.0
9,1871,BS1,,barrofr01,18,17.0,18,18.0,0,0,...,1,0,0,13,0,4,17,0.0,0.0,0.0


In [5]:
appearances = appearances[appearances.yearID >= 1985]
appearances

Unnamed: 0,yearID,teamID,lgID,playerID,G_all,GS,G_batting,G_defense,G_p,G_c,...,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_of,G_dh,G_ph,G_pr
62204,1985,BAL,AL,aasedo01,54,0.0,0,54.0,54,0,...,0,0,0,0,0,0,0,0.0,0.0,0.0
62205,1985,CHN,NL,abregjo01,6,5.0,6,6.0,6,0,...,0,0,0,0,0,0,0,0.0,0.0,0.0
62206,1985,TOR,AL,ackerji01,61,0.0,0,61.0,61,0,...,0,0,0,0,0,0,0,0.0,0.0,0.0
62207,1985,SFN,NL,adamsri02,54,32.0,54,46.0,0,0,...,5,17,25,0,0,0,0,0.0,5.0,8.0
62208,1985,CHA,AL,agostju01,54,0.0,4,54.0,54,0,...,0,0,0,0,0,0,0,0.0,0.0,0.0
62209,1985,PHI,NL,aguaylu01,91,42.0,91,82.0,0,0,...,17,7,60,0,0,0,0,0.0,11.0,9.0
62210,1985,NYN,NL,aguilri01,22,19.0,22,21.0,21,0,...,0,0,0,0,0,0,0,0.0,0.0,1.0
62211,1985,TOR,AL,aikenwi01,12,6.0,12,0.0,0,0,...,0,0,0,0,0,0,0,11.0,6.0,0.0
62212,1985,TOR,AL,alexado01,36,36.0,0,36.0,36,0,...,0,0,0,0,0,0,0,0.0,0.0,0.0
62213,1985,TOR,AL,allenga01,14,11.0,14,14.0,0,14,...,0,0,0,0,0,0,0,0.0,0.0,0.0


In [6]:
people = pd.read_csv('baseballdatabank-2019.2/core/People.csv')


In [7]:
salaries.set_index("playerID", inplace=True)
people.set_index("playerID", inplace=True)

In [8]:
salaries

Unnamed: 0_level_0,yearID,teamID,lgID,G_all,GS,G_batting,G_defense,G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_of,G_dh,G_ph,G_pr
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
abercda01,1871,TRO,,1,1.0,1,1.0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,0.0
addybo01,1871,RC1,,25,25.0,25,25.0,0,0,0,22,0,3,0,0,0,0,0.0,0.0,0.0
allisar01,1871,CL1,,29,29.0,29,29.0,0,0,0,2,0,0,0,29,0,29,0.0,0.0,0.0
allisdo01,1871,WS3,,27,27.0,27,27.0,0,27,0,0,0,0,0,0,0,0,0.0,0.0,0.0
ansonca01,1871,RC1,,25,25.0,25,25.0,0,5,1,2,20,0,1,0,0,1,0.0,0.0,0.0
armstbo01,1871,FW1,,12,12.0,12,12.0,0,0,0,0,0,0,0,11,1,12,0.0,0.0,0.0
barkeal01,1871,RC1,,1,1.0,1,1.0,0,0,0,0,0,0,1,0,0,1,0.0,0.0,0.0
barnero01,1871,BS1,,31,31.0,31,31.0,0,0,0,16,0,15,0,0,0,0,0.0,0.0,0.0
barrebi01,1871,FW1,,1,1.0,1,1.0,0,1,0,0,1,0,0,0,0,0,0.0,0.0,0.0
barrofr01,1871,BS1,,18,17.0,18,18.0,0,0,0,1,0,0,13,0,4,17,0.0,0.0,0.0


In [10]:
result = pd.concat([salaries, people], axis=1, join_axes=[salaries.index])

In [11]:
result

Unnamed: 0_level_0,yearID,teamID,lgID,G_all,GS,G_batting,G_defense,G_p,G_c,G_1b,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
abercda01,1871,TRO,,1,1.0,1,1.0,0,0,0,...,Abercrombie,Francis Patterson,,,,,1871-10-21,1871-10-21,aberd101,abercda01
addybo01,1871,RC1,,25,25.0,25,25.0,0,0,0,...,Addy,Robert Edward,160.0,68.0,L,L,1871-05-06,1877-10-06,addyb101,addybo01
allisar01,1871,CL1,,29,29.0,29,29.0,0,0,0,...,Allison,Arthur Algernon,150.0,68.0,,,1871-05-04,1876-10-05,allia101,allisar01
allisdo01,1871,WS3,,27,27.0,27,27.0,0,27,0,...,Allison,Douglas L.,160.0,70.0,R,R,1871-05-05,1883-07-13,allid101,allisdo01
ansonca01,1871,RC1,,25,25.0,25,25.0,0,5,1,...,Anson,Adrian Constantine,227.0,72.0,R,R,1871-05-06,1897-10-03,ansoc101,ansonca01
armstbo01,1871,FW1,,12,12.0,12,12.0,0,0,0,...,Armstrong,Robert Livingston,160.0,74.0,,,1871-06-26,1871-08-29,armsr101,armstsa01
barkeal01,1871,RC1,,1,1.0,1,1.0,0,0,0,...,Barker,Alfred L.,162.0,72.0,,,1871-06-01,1871-06-01,barka101,barkeal01
barnero01,1871,BS1,,31,31.0,31,31.0,0,0,0,...,Barnes,Charles Roscoe,145.0,68.0,R,R,1871-05-05,1881-09-21,barnr102,barnero01
barrebi01,1871,FW1,,1,1.0,1,1.0,0,1,0,...,Barrett,William,,,,,1871-07-08,1873-10-18,barrb102,barrebi01
barrofr01,1871,BS1,,18,17.0,18,18.0,0,0,0,...,Barrows,Franklin Lee,,,,,1871-05-20,1871-10-07,barrf102,barrofr01
