# Reading and Manipulating Data with Pandas
## Employee Data & Major League Baseball Data
---
### [Play with the data in Google Colab.](https://colab.research.google.com/github/valeriemagalong/Val-Practices-Python/blob/main/Employee_MLB_Pandas/Pandas_Practice_Employee_MLB_Data.ipynb)

---
### Part One - Employee Data
---

1. Import pandas as “pd”.

In [1]:
import pandas as pd

2. Create three lists with five elements each. The first list should contain employees' first names, the second their last names, and the third should list their ages. You can make up this data.

In [2]:
first_names = ['Nicki', 'Taylor', 'Idris', 'Jackie', 'Kim']
last_names = ['Minaj', 'Swift', 'Elba', 'Chan', 'Kardashian']
ages = [40, 33, 50, 69, 42]

3. Create a data dictionary from the three lists.

In [3]:
employees = {
    'first_names': first_names,
    'last_names': last_names,
    'ages': ages
}

4. Create a pandas Series from the employee age list.

In [4]:
# Make a list of the employees' full names
full_names = [f"{first} {last}" for first, last in zip(first_names, last_names)]

# full_names will be the indexes of age_series
ages_series = pd.Series(ages, index = full_names, name = 'Employee Ages')

ages_series

Nicki Minaj       40
Taylor Swift      33
Idris Elba        50
Jackie Chan       69
Kim Kardashian    42
Name: Employee Ages, dtype: int64

5. Determine the average employee age.

In [5]:
average_employee_age = ages_series.mean()

print(f"The average employee age is: {average_employee_age}")

The average employee age is: 46.8


6. Write the Series to a CSV file.

In [6]:
ages_series.to_csv("employee_ages.csv")

---
### Part Two - MLB Data
---

1. Import `21st_century_MLB_Batting.csv` into Python as a data frame.

In [7]:
mlb_df = pd.read_csv("https://raw.githubusercontent.com/valeriemagalong/Val-Practices-Python/main/Employee_MLB_Pandas/21st_century_MLB_Batting.csv")

mlb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28163 entries, 0 to 28162
Data columns (total 17 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   playerID  28163 non-null  object
 1   yearID    28163 non-null  int64 
 2   stint     28163 non-null  int64 
 3   teamID    28163 non-null  object
 4   lgID      28163 non-null  object
 5   G         28163 non-null  int64 
 6   AB        28163 non-null  int64 
 7   R         28163 non-null  int64 
 8   H         28163 non-null  int64 
 9   2B        28163 non-null  int64 
 10  3B        28163 non-null  int64 
 11  HR        28163 non-null  int64 
 12  RBI       28163 non-null  int64 
 13  SB        28163 non-null  int64 
 14  CS        28163 non-null  int64 
 15  BB        28163 non-null  int64 
 16  SO        28163 non-null  int64 
dtypes: int64(14), object(3)
memory usage: 3.7+ MB


2. Print the first five items in the data frame.

In [8]:
mlb_df.head(5)

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO
0,abbotje01,2000,1,CHA,AL,80,215,31,59,15,1,3,29,2,1,21,38
1,abbotku01,2000,1,NYN,NL,79,157,22,34,7,1,6,12,1,1,14,51
2,abbotpa01,2000,1,SEA,AL,35,5,1,2,1,0,0,0,0,0,0,1
3,abreubo01,2000,1,PHI,NL,154,576,103,182,42,10,25,79,28,8,100,116
4,aceveju01,2000,1,MIL,NL,62,1,1,0,0,0,0,0,0,0,1,1


3. Print the last six items in the data frame.

In [9]:
mlb_df.tail(6)

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO
28157,zimmebr01,2019,1,CLE,AL,9,13,1,0,0,0,0,0,0,0,1,7
28158,zimmejo02,2019,1,DET,AL,23,2,0,0,0,0,0,0,0,0,0,2
28159,zimmeky01,2019,1,KCA,AL,15,0,0,0,0,0,0,0,0,0,0,0
28160,zimmery01,2019,1,WAS,NL,52,171,20,44,9,0,6,27,0,0,17,39
28161,zobribe01,2019,1,CHN,NL,47,150,24,39,5,0,1,17,0,0,23,24
28162,zuninmi01,2019,1,TBA,AL,90,266,30,44,10,1,9,32,0,0,20,98


4. Determine the average of the HR column.


In [10]:
average_HR = mlb_df['HR'].mean()

print(f"The average of the HR column is: {average_HR}")

The average of the HR column is: 3.695593509214217


5. Calculate the max of the `SO` column.

In [11]:
max_SO = mlb_df['SO'].max()

print(f"The max of the SO column is: {max_SO}")

The max of the SO column is: 223


6. Save the data frame as an Excel file.

In [12]:
mlb_df.to_excel("21st_century_MLB_Batting.xlsx")

7. Create a subset of the data frame for players with more than 40 HR.

In [13]:
mlb_40_hr_df = mlb_df[mlb_df['HR'] > 40]

mlb_40_hr_df

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO
55,bagweje01,2000,1,HOU,NL,159,590,152,183,37,1,47,132,9,6,107,116
72,batisto01,2000,1,TOR,AL,154,620,96,163,32,2,41,114,5,4,35,121
127,bondsba01,2000,1,SFN,NL,143,480,129,147,28,4,49,106,11,3,117,77
300,delgaca01,2000,1,TOR,AL,162,569,115,196,57,1,41,137,0,1,123,104
341,edmonji01,2000,1,SLN,NL,152,525,129,155,25,0,42,108,10,3,103,167
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26911,cruzne02,2019,1,MIN,AL,120,454,81,141,26,0,41,108,0,1,56,131
27917,solerjo01,2019,1,KCA,AL,162,589,95,156,33,1,48,117,3,1,73,178
27965,suareeu01,2019,1,CIN,NL,159,575,87,156,22,2,49,103,3,2,70,189
28021,troutmi01,2019,1,LAA,AL,134,470,110,137,27,2,45,104,11,2,110,120


8. Save this new subset as a json file.

In [14]:
mlb_40_hr_df.to_json("21st_century_MLB_Batting_Subset.json")