In [4]:
import pandas as pd

**Reading the dataset**

In [5]:
dt_baseball = pd.read_csv('/content/mlbbat10.csv')

*Investigating the number of columns and elements that the table has. Which is 1199 elements (players) with 19 columns*

In [None]:
dt_baseball.shape

(1199, 19)

Let's analyze whether there is missing data in the table so that the statistics are not wrong.

In [6]:
dt_baseball[pd.isnull(dt_baseball.name)]

Unnamed: 0,name,team,position,game,at_bat,run,hit,double,triple,home_run,rbi,total_base,walk,strike_out,stolen_base,caught_stealing,obp,slg,bat_avg


*Initially it will be analyzed which are the columns that the dataset has:*

In [None]:
dt_baseball.columns

Index(['name', 'team', 'position', 'game', 'at_bat', 'run', 'hit', 'double',
       'triple', 'home_run', 'rbi', 'total_base', 'walk', 'strike_out',
       'stolen_base', 'caught_stealing', 'obp', 'slg', 'bat_avg'],
      dtype='object')

*If you don't understand baseball that's fine, i'll help you understand what each column means.*



*   **name** : the name of the player
*   **team** : the player's team
*   **position** : the player's position
*   **game** : how many games did he play
*   **at_bat** : the number of times the player had time at bat against a pitcher
*   **run** : is the number of runs the player has made safely around 3 bases and back safely before 3 outs are made.
*   **hit** : how many times did the player reach safety after hitting the ball
*   **double** : how many times did the player hit the pitched ball and safely reached second base
*   **triple** : how many times did the player hit the ball and reached third base, without the help of an intervening error or attempt to put out another baserunner.
*   **home run** : how many times the player did hit the ball and was able to circle the bases and reach home plate safely.
*   **rbi** : is how many times the player who did the batter made a play the allowed a run to be scored.
*   **total_base** : how many bases the player gained with hits.
*   **walk** : how many times the player (batter) received four piches during a plate appearance.
*   **strike_out** : how many times the player recieve three strikes during a time at the bat.
*   **stolen_base** : how many times the player made a stolen base, which is when the player advances to a base.
*   **caught_stealing** : how many times the player was caught stealing, which is when the runner attempts to advance from one to another without the ball being batted and is tagged out by other player (fielder).
*   **obp** : it is called ob base percentage, and is how frequently a batter reaches base per plate appearance.
*   **slg** : it is called slugging percentage, and represents the total number of bases a player records per at bat.
*   **bat_avg** : players hits by total at bats, measures the perfomance of batters.





What types of data does each column have?

In [None]:
dt_baseball.dtypes

name                object
team                object
position            object
game                 int64
at_bat               int64
run                  int64
hit                  int64
double               int64
triple               int64
home_run             int64
rbi                  int64
total_base           int64
walk                 int64
strike_out           int64
stolen_base          int64
caught_stealing      int64
obp                float64
slg                float64
bat_avg            float64
dtype: object

For better analysis, the elements with int type will be change to float.

In [None]:
dt_baseball.game.astype('float64')

0       162.0
1       157.0
2       157.0
3       160.0
4       160.0
        ...  
1194      3.0
1195      4.0
1196      7.0
1197      6.0
1198      4.0
Name: game, Length: 1199, dtype: float64

Checking if there are players who do not have a specific position.

In [None]:
dt_baseball.loc[dt_baseball['position'].isnull()]

Unnamed: 0,name,team,position,game,at_bat,run,hit,double,triple,home_run,rbi,total_base,walk,strike_out,stolen_base,caught_stealing,obp,slg,bat_avg


*The first 5 elements of the dataset are:*

In [None]:
dt_baseball.head()

Unnamed: 0,name,team,position,game,at_bat,run,hit,double,triple,home_run,rbi,total_base,walk,strike_out,stolen_base,caught_stealing,obp,slg,bat_avg
0,I Suzuki,SEA,OF,162,680,74,214,30,3,6,43,268,45,86,42,9,0.359,0.394,0.315
1,D Jeter,NYY,SS,157,663,111,179,30,3,10,67,245,63,106,18,5,0.34,0.37,0.27
2,M Young,TEX,3B,157,656,99,186,36,3,21,91,291,50,115,4,2,0.33,0.444,0.284
3,J Pierre,CWS,OF,160,651,96,179,18,3,1,47,206,45,47,68,18,0.341,0.316,0.275
4,R Weeks,MIL,2B,160,651,112,175,32,4,29,83,302,76,184,11,4,0.366,0.464,0.269


*I'm a New York Yankees fan, so I'm going to show you all the players from that team*



In [None]:
dt_baseball[dt_baseball['team']=='NYY']['name']

1             D Jeter
8              R Cano
20         M Teixeira
58          N Swisher
92        A Rodriguez
122         B Gardner
131      C Granderson
167          A Kearns
182          J Posada
268        F Cervelli
315          M Thames
374            R Pena
427         L Berkman
472         N Johnson
488         J Miranda
506            R Winn
510          C Curtis
552           E Nunez
555           K Russo
666          G Golson
690         C Huffman
709         C Moeller
792        C Sabathia
811        A Pettitte
903         A Burnett
930          P Hughes
953          M Rivera
967         J Vazquez
971          A Aceves
1010    J Chamberlain
1042         C Gaudin
1079          B Logan
1090          D Marte
1101          S Mitre
1109           I Nova
1120           C Park
1145      D Robertson
Name: name, dtype: object

The first player of the dataset is:

In [None]:
dt_baseball.iloc[0]

name               I Suzuki
team                    SEA
position                 OF
game                    162
at_bat                  680
run                      74
hit                     214
double                   30
triple                    3
home_run                  6
rbi                      43
total_base              268
walk                     45
strike_out               86
stolen_base              42
caught_stealing           9
obp                   0.359
slg                   0.394
bat_avg               0.315
Name: 0, dtype: object

If you want to know which player is the one who have the id 10

In [None]:
dt_baseball.loc[10:10, 'team']

10    MIL
Name: team, dtype: object

Showing you only the players, which team they play for and in which position.

In [None]:
dt_baseball.loc[0:, ['name', 'team', 'position']]

Unnamed: 0,name,team,position
0,I Suzuki,SEA,OF
1,D Jeter,NYY,SS
2,M Young,TEX,3B
3,J Pierre,CWS,OF
4,R Weeks,MIL,2B
...,...,...,...
1194,B Wood,KC,P
1195,M Wuertz,OAK,P
1196,M Zagurski,PHI,P
1197,B Ziegler,OAK,P


One of the most importants plays in baseball is hits, so let's see who are the main players who participated in these

In [None]:
dt_baseball.loc[dt_baseball["hit"]].sort_values(by="hit", ascending=False)

Unnamed: 0,name,team,position,game,at_bat,run,hit,double,triple,home_run,rbi,total_base,walk,strike_out,stolen_base,caught_stealing,obp,slg,bat_avg
0,I Suzuki,SEA,OF,162,680,74,214,30,3,6,43,268,45,86,42,9,0.359,0.394,0.315
0,I Suzuki,SEA,OF,162,680,74,214,30,3,6,43,268,45,86,42,9,0.359,0.394,0.315
0,I Suzuki,SEA,OF,162,680,74,214,30,3,6,43,268,45,86,42,9,0.359,0.394,0.315
0,I Suzuki,SEA,OF,162,680,74,214,30,3,6,43,268,45,86,42,9,0.359,0.394,0.315
0,I Suzuki,SEA,OF,162,680,74,214,30,3,6,43,268,45,86,42,9,0.359,0.394,0.315
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,S Smith,COL,OF,133,358,55,88,19,5,17,52,168,35,67,2,1,0.314,0.469,0.246
187,F Lopez,STL,3B,109,376,50,87,18,1,7,36,128,43,77,8,2,0.310,0.340,0.231
186,M LaPorta,CLE,1B,110,376,41,83,15,1,12,41,136,46,82,0,0,0.306,0.362,0.221
186,M LaPorta,CLE,1B,110,376,41,83,15,1,12,41,136,46,82,0,0,0.306,0.362,0.221


One of the most important plays in baseball is the home run, so let's find out the statistics in the table about this play

In [None]:
dt_baseball.home_run.describe()

count    1199.000000
mean        3.847373
std         7.372345
min         0.000000
25%         0.000000
50%         0.000000
75%         4.000000
max        54.000000
Name: home_run, dtype: float64

What positions are the ones that hit the most home runs?

In [None]:
dt_baseball.loc[dt_baseball["hit"]]["position"].value_counts()

OF    692
SS    150
2B    129
3B    110
1B     84
C      20
DH     14
Name: position, dtype: int64

What was the average number of games that the league's players played?

In [None]:
dt_baseball.game.mean()

50.534612176814015

Which teams have the most players? And which teams have the fewest players in this database?

In [None]:
dt_baseball.team.value_counts()

FLA    53
NYM    47
PIT    46
WSH    45
ARI    45
COL    44
HOU    44
LAD    44
CIN    43
ATL    42
BOS    42
STL    41
SD     41
SF     40
SEA    40
MIL    40
PHI    39
CHC    39
LAA    38
NYY    37
TEX    37
DET    36
MIN    36
BAL    36
OAK    36
CLE    35
TB     34
TOR    34
KC     33
CWS    32
Name: team, dtype: int64

Strike outs are statistics that reveal the pitcher's dominance and the batsman's incompetence, so let's analyze the players who have the most strike outs and are in the pitcher position.

In [None]:
dt_baseball.loc[dt_baseball["position"]=="P"][["name", "strike_out"]].sort_values(by="strike_out", ascending=False)

Unnamed: 0,name,strike_out
449,R Halladay,42
487,R Lopez,35
483,M Cain,33
525,P Maholm,31
479,B Arroyo,31
...,...,...
1018,C Daigle,0
1019,M Daley,0
1020,S Deduno,0
1021,E Del Rosario,0


Another very popular team is the Los Angeles Dodgers, let's see their players.

In [None]:
dt_baseball.loc[dt_baseball.team.map(lambda team : team == 'LAD')].name

19             M Kemp
33            J Loney
38          R Theriot
98           A Ethier
104           C Blake
181          R Furcal
202         J Carroll
218          R Martin
234         R Barajas
323         R Johnson
330         M Ramirez
362        R Belliard
373        G Anderson
380       S Podsednik
395          J Castro
406            X Paul
423           A Ellis
469         J Gibbons
490          B Ausmus
495     C Billingsley
529         C Kershaw
530          H Kuroda
538           T Lilly
574        R Mitchell
637             J Ely
648         V Padilla
667              C Hu
668         T Oeltjen
710     C Monasterios
721         J Lindsey
771          C Haeger
794          J Weaver
886        R Troncoso
932          K Jansen
934             H Kuo
959     T Schlichting
984       R Belisario
998         J Broxton
1028         S Elbert
1078           J Link
1100         J Miller
1118          R Ortiz
1160       G Sherrill
1179       J Taschner
Name: name, dtype: object

Stolen base is one of the important defensive plays, let's check which players have the highest number of stolen bases

In [None]:
dt_baseball.loc[dt_baseball["stolen_base"]][["name", "stolen_base"]].sort_values(by="stolen_base", ascending=False)

Unnamed: 0,name,stolen_base
3,J Pierre,68
3,J Pierre,68
3,J Pierre,68
3,J Pierre,68
3,J Pierre,68
...,...,...
24,B Butler,0
20,M Teixeira,0
20,M Teixeira,0
28,A Gonzalez,0


What positions had the fewest home runs?

In [None]:
dt_baseball.groupby('position').home_run.count().sort_values()

position
-       8
DH     25
1B     69
3B     71
SS     71
2B     72
C     113
OF    226
P     544
Name: home_run, dtype: int64

How many home runs did each team hit?

In [None]:
dt_baseball.groupby('team').home_run.count()

team
ARI    45
ATL    42
BAL    36
BOS    42
CHC    39
CIN    43
CLE    35
COL    44
CWS    32
DET    36
FLA    53
HOU    44
KC     33
LAA    38
LAD    44
MIL    40
MIN    36
NYM    47
NYY    37
OAK    36
PHI    39
PIT    46
SD     41
SEA    40
SF     40
STL    41
TB     34
TEX    37
TOR    34
WSH    45
Name: home_run, dtype: int64

Which player had the most home runs?



In [None]:
dt_baseball.loc[(dt_baseball.home_run.idxmax()), 'name']

'J Bautista'

Who had the most home runs on each team?

In [None]:
dt_baseball.loc[(dt_baseball.groupby('team').home_run.idxmax()), ('name', 'home_run', 'team')].sort_values(by='home_run')

Unnamed: 0,name,home_run,team
71,K Kouzmanoff,16,OAK
68,Y Betancourt,16,KC
120,B McCann,21,ATL
27,G Jones,21,PIT
73,S Choo,22,CLE
185,R Branyan,25,SEA
133,A Ramirez,25,CHC
262,J Thome,25,MIN
13,H Pence,25,HOU
138,M Napoli,26,LAA


I want to know which player on each team had the most strike outs

In [None]:
dt_baseball.groupby(['team']).apply(lambda jogador : jogador.loc[jogador.strike_out.idxmax()])

Unnamed: 0_level_0,name,team,position,game,at_bat,run,hit,double,triple,home_run,rbi,total_base,walk,strike_out,stolen_base,caught_stealing,obp,slg,bat_avg
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
ARI,M Reynolds,ARI,3B,145,499,79,99,17,2,32,85,216,83,211,7,4,0.32,0.433,0.198
ATL,D Lee,ATL,1B,148,547,80,142,35,0,19,80,234,73,134,1,3,0.347,0.428,0.26
BAL,A Jones,BAL,OF,149,581,76,165,25,5,19,69,257,23,119,7,7,0.325,0.442,0.284
BOS,D Ortiz,BOS,DH,145,518,86,140,36,1,32,102,274,82,145,0,1,0.37,0.529,0.27
CHC,A Soriano,CHC,OF,147,496,67,128,40,3,24,79,246,45,123,5,1,0.322,0.496,0.258
CIN,D Stubbs,CIN,OF,150,514,91,131,19,6,22,77,228,55,168,30,6,0.329,0.444,0.255
CLE,S Choo,CLE,OF,144,550,81,165,31,2,22,90,266,83,118,22,7,0.401,0.484,0.3
COL,C Gonzalez,COL,OF,145,587,111,197,34,9,34,117,351,40,135,26,8,0.376,0.598,0.336
CWS,P Konerko,CWS,1B,149,548,89,171,30,1,39,111,320,72,110,0,1,0.393,0.584,0.312
DET,A Jackson,DET,OF,151,618,103,181,34,10,4,41,247,47,170,27,6,0.345,0.4,0.293
