<h1><center>Pandas DataFrame</center></h1>

## Prep: Import the Pandas package

If the code below fails, install the pandas package with your command line:

`conda install pandas`

In [2]:
import pandas as pd

## Section 1: Create a DataFrame

Pandas DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object. 

Pandas DataFrame can be created in multiple ways. Let’s discuss different ways to create a DataFrame one by one.

#### Method #1: Creating a DataFrame from a nested list.

In [3]:
# Create a list of lists
my_data = [['Tom', 3.8],['Joe', 3.2],['Li', 2.7],['Kurt', 4.0]]

 
# Create the pandas DataFrame by providing 
# the data and column indices
df = pd.DataFrame(data = my_data, columns = ['Name', 'GPA'])

 
# Show the dataframe from REPL.
df

Unnamed: 0,Name,GPA
0,Tom,3.8
1,Joe,3.2
2,Li,2.7
3,Kurt,4.0


#### Method #2: Creating a DataFrame from a dictionary of lists

To create a DataFrame from a dictionary of lists, all the lists must be of same length. If index is passed then the length of index should be equal to the length of each list. If no index is passed, then by default, index will be range(n) where n is the length of each list.

In [5]:
# Create a dictionary of lists.
data = {'Name' : ['Tom', 'Joe', 'Li', 'Kurt'],
       'GPA' : [3.8, 3.2, 2.7, 4.0]}

 
# Create a DataFrame with the dictionary 
df = pd.DataFrame(data)

 
# Show the DataFrame from REPL
df

Unnamed: 0,Name,GPA
0,Tom,3.8
1,Joe,3.2
2,Li,2.7
3,Kurt,4.0


The row indices are default to numbers range(n). But you can also supply customized row indices - the argument name is `index`.

In [6]:
# Create a dictionary of lists.
data = {'Name' : ['Tom', 'Joe', 'Li', 'Kurt'],
       'GPA' : [3.8, 3.2, 2.7, 4.0]}

 
# Create a DataFrame with the dictionary 
df = pd.DataFrame(data, index = ['Stu1', 'Stu2', 'Stu3', 'Stu4'])

 
# Show the DataFrame from REPL
df

Unnamed: 0,Name,GPA
Stu1,Tom,3.8
Stu2,Joe,3.2
Stu3,Li,2.7
Stu4,Kurt,4.0


#### Method #3: Create a DataFrame by reading it from a data file in `csv` or `excel` format.

Read the given dataset attached on Canvas into a Pandas DataFrame. Name the DataFrame as `df`

In [7]:
df = pd.read_csv('nfl_height_weight.csv')



In [8]:
# Show the dataframe from REPL
df

Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
0,23,"Alford, Robert",CB,70,186,11/1/1988,ATL
1,95,"Babineaux, Jonathan",DT,74,300,10/12/1981,ATL
2,72,"Baker, Sam",T,77,301,5/30/1985,ATL
3,59,"Bartu, Joplo",OLB,74,230,10/3/1990,ATL
4,71,"Biermann, Kroy",OLB,75,255,9/12/1985,ATL
...,...,...,...,...,...,...,...
1869,67,"White, Cody",G,75,303,7/1/1988,HOU
1870,73,"Williams, Brennan",OT,78,314,2/5/1991,HOU
1871,54,"Williams, Trevardo",OLB,73,237,12/31/1990,HOU
1872,41,"Wood, Cierre",RB,71,215,2/21/1991,HOU


The `read_csv()` function can also understand the csv files hosted on the Internet. You just need to supply the URL to the file.

For example, given the same file hosted on Github: https://raw.githubusercontent.com/BlueJayADAL/CS121/main/datasets/nfl_height_weight.csv

In [9]:
df = pd.read_csv('https://raw.githubusercontent.com/BlueJayADAL/CS121/main/datasets/nfl_height_weight.csv')



In [10]:
# Show the dataframe from REPL
df

Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
0,23,"Alford, Robert",CB,70,186,11/1/1988,ATL
1,95,"Babineaux, Jonathan",DT,74,300,10/12/1981,ATL
2,72,"Baker, Sam",T,77,301,5/30/1985,ATL
3,59,"Bartu, Joplo",OLB,74,230,10/3/1990,ATL
4,71,"Biermann, Kroy",OLB,75,255,9/12/1985,ATL
...,...,...,...,...,...,...,...
1869,67,"White, Cody",G,75,303,7/1/1988,HOU
1870,73,"Williams, Brennan",OT,78,314,2/5/1991,HOU
1871,54,"Williams, Trevardo",OLB,73,237,12/31/1990,HOU
1872,41,"Wood, Cierre",RB,71,215,2/21/1991,HOU


## Section 2: Exploratory Data Analysis - take a sneak peek to the DataFrame

#### Check out the contents in the DataFrame `df` using `head()` method.

In [11]:
# head() by default displays the first 5 records of data from the DataFrame
df.head()



Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
0,23,"Alford, Robert",CB,70,186,11/1/1988,ATL
1,95,"Babineaux, Jonathan",DT,74,300,10/12/1981,ATL
2,72,"Baker, Sam",T,77,301,5/30/1985,ATL
3,59,"Bartu, Joplo",OLB,74,230,10/3/1990,ATL
4,71,"Biermann, Kroy",OLB,75,255,9/12/1985,ATL


#### Try to read the first 10 and 20 records as well

In [12]:
df.head(10)


Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
0,23,"Alford, Robert",CB,70,186,11/1/1988,ATL
1,95,"Babineaux, Jonathan",DT,74,300,10/12/1981,ATL
2,72,"Baker, Sam",T,77,301,5/30/1985,ATL
3,59,"Bartu, Joplo",OLB,74,230,10/3/1990,ATL
4,71,"Biermann, Kroy",OLB,75,255,9/12/1985,ATL
5,63,"Blalock, Justin",G,76,326,12/20/1983,ATL
6,5,"Bosher, Matt",P,72,208,10/18/1987,ATL
7,3,"Bryant, Matt",K,69,203,5/29/1975,ATL
8,51,"Chaney, Jamar",LB,72,242,10/11/1986,ATL
9,86,"Coffman, Chase",TE,78,250,11/10/1986,ATL


In [13]:
df.head(20)


Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
0,23,"Alford, Robert",CB,70,186,11/1/1988,ATL
1,95,"Babineaux, Jonathan",DT,74,300,10/12/1981,ATL
2,72,"Baker, Sam",T,77,301,5/30/1985,ATL
3,59,"Bartu, Joplo",OLB,74,230,10/3/1990,ATL
4,71,"Biermann, Kroy",OLB,75,255,9/12/1985,ATL
5,63,"Blalock, Justin",G,76,326,12/20/1983,ATL
6,5,"Bosher, Matt",P,72,208,10/18/1987,ATL
7,3,"Bryant, Matt",K,69,203,5/29/1975,ATL
8,51,"Chaney, Jamar",LB,72,242,10/11/1986,ATL
9,86,"Coffman, Chase",TE,78,250,11/10/1986,ATL


#### Exercise: Can you guess how to view the last few records?

In [14]:
df.tail()



Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
1869,67,"White, Cody",G,75,303,7/1/1988,HOU
1870,73,"Williams, Brennan",OT,78,314,2/5/1991,HOU
1871,54,"Williams, Trevardo",OLB,73,237,12/31/1990,HOU
1872,41,"Wood, Cierre",RB,71,215,2/21/1991,HOU
1873,13,"Yates, T.J.",QB,76,217,5/28/1987,HOU


## Section 3: Attributes of a DataFrame

#### Check out to see all the column data types of the DataFrame.

In [15]:
df.dtypes



number               int64
full_name           object
position            object
height_in_inches     int64
weight_in_lbs        int64
date_of_birth       object
team                object
dtype: object

#### List all the column names:

In [17]:
for col_name in df.columns:
    print(col_name)


number
full_name
position
height_in_inches
weight_in_lbs
date_of_birth
team


#### Show the shape (dimensionality) of the DataFrame

In [18]:
# The return is a tuple, where the first part is the # of records,
# and the second part is the # of columns

df.shape


(1874, 7)

## Section 4: Some basic methods of a DataFrame object

#### Check out the basic information of the DataFrame including column names, dtypes and memory usage, etc.


In [19]:
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1874 entries, 0 to 1873
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   number            1874 non-null   int64 
 1   full_name         1874 non-null   object
 2   position          1874 non-null   object
 3   height_in_inches  1874 non-null   int64 
 4   weight_in_lbs     1874 non-null   int64 
 5   date_of_birth     1874 non-null   object
 6   team              1874 non-null   object
dtypes: int64(3), object(4)
memory usage: 102.6+ KB


#### Show the descriptive statistics of the DataFrame

In [21]:
df.describe()



Unnamed: 0,number,height_in_inches,weight_in_lbs
count,1874.0,1874.0,1874.0
mean,52.694237,73.965848,247.282818
std,28.819558,2.642199,45.728803
min,1.0,65.0,165.0
25%,27.0,72.0,208.0
50%,54.0,74.0,240.0
75%,79.0,76.0,293.0
max,99.0,81.0,359.0


#### Calculate average, min, max, and standard deviation for all numeric columns


In [22]:
df.mean()



number               52.694237
height_in_inches     73.965848
weight_in_lbs       247.282818
dtype: float64

In [23]:
df.min()



number                              1
full_name           Abdul-Quddus, Isa
position                            C
height_in_inches                   65
weight_in_lbs                     165
date_of_birth                1/1/1985
team                              ARI
dtype: object

In [24]:
df.max()



number                          99
full_name           Zuttah, Jeremy
position                        WR
height_in_inches                81
weight_in_lbs                  359
date_of_birth             9/9/1989
team                           WAS
dtype: object

In [25]:
df.std()



number              28.819558
height_in_inches     2.642199
weight_in_lbs       45.728803
dtype: float64

#### Exercise: What are the mean values of the first 50 records in the dataset? 

In [26]:
df.head(50).mean()



number               52.42
height_in_inches     74.04
weight_in_lbs       246.02
dtype: float64