#### Learning by Examples
In our "Try it Yourself" editor, you can use the Pandas module, and modify the code to see the result.

In [2]:
#Load a CSV file into a Pandas DataFrame:
import pandas as pd
df = pd.read_csv("Universities.csv")
print(df.to_string())

            Univ   SAT  Top10  Accept  SFRatio  Expenses  GradRate
0          Brown  1310     89      22       13     22704        94
1        CalTech  1415    100      25        6     63575        81
2            CMU  1260     62      59        9     25026        72
3       Columbia  1310     76      24       12     31510        88
4        Cornell  1280     83      33       13     21864        90
5      Dartmouth  1340     89      23       10     32162        95
6           Duke  1315     90      30       12     31585        95
7     Georgetown  1255     74      24       12     20126        92
8        Harvard  1400     91      14       11     39525        97
9   JohnsHopkins  1305     75      44        7     58691        87
10           MIT  1380     94      30       10     34870        91
11  Northwestern  1260     85      39       11     28052        89
12     NotreDame  1255     81      42       13     15122        94
13     PennState  1081     38      54       18     10185      

### Pandas Introduction

#### What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

#### Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

 #### Data Science: is a branch of computer science where we study how to store, use and analyze data for deriving information from it.

#### What Can Pandas Do?
Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

#### Import Pandas
Once Pandas is installed, import it in your applications by adding the import keyword:

In [3]:
import pandas

Now Pandas is imported and ready to use.

In [6]:
import pandas
my_cars = {
    'cars': ["BMW", "Volvo", "Ford"],
    'passings': [3, 7, 2]
}
my_var = pandas.DataFrame(my_cars)
print(my_var)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


#### Pandas as pd
Pandas is usually imported under the pd alias.

In [7]:
import pandas as pd

Now the Pandas package can be referred to as pd instead of pandas.

In [8]:
import pandas as pd
my_cars = {
    'cars': ["BMW", "Volvo", "Ford"],
    'passings': [3, 7, 2]
}
my_var = pd.DataFrame(my_cars)
print(my_var)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


### Pandas Series

#### What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [9]:
#Create a simple Pandas Series from a list:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)

0    1
1    7
2    2
dtype: int64


#### Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [10]:
#Return the first value of the Series:
print(myvar[0])

1


#### Create Labels
With the index argument, you can name your own labels.

In [11]:
#Create you own labels:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ['x', 'y', 'z'])
print(myvar)

x    1
y    7
z    2
dtype: int64


When you have created labels, you can access an item by referring to the label.

In [12]:
#Return the value of "y":
print(myvar['y'])

7


#### Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.

In [19]:
#Create a simple Pandas Series from a dictionary:
import pandas as ps
calories = {'day1': 420, 'day2': 380, 'day3': 390}
myvar = pd.Series(calories)
print(myvar)

day1    420
day2    380
day3    390
dtype: int64


Note: The keys of the dictionary become the labels.

To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.

In [15]:
#Create a Series using only data from "day1" and "day2":
import pandas as pd
calories = {'day1': 420, "day2": 380, "day3": 390}
my_var = pd.Series(calories, index = ['day1', 'day2'])
print(my_var)

day1    420
day2    380
dtype: int64


#### DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

In [17]:
#Create a DataFrame from two Series:
import pandas as pd
data = {
    'calories': [420, 380, 390],
    'duration': [50, 40, 45]
}
my_var = pd.DataFrame(data)
print(my_var)

   calories  duration
0       420        50
1       380        40
2       390        45


### Pandas DataFrames

#### What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [21]:
#Create a simple Pandas DataFrame:
import pandas as pd
data = {
    'calories': [420, 380, 390],
    'duration': [50, 45, 40]
}
df = pd.DataFrame(data)
print(df)

   calories  duration
0       420        50
1       380        45
2       390        40


#### Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

In [22]:
#Return row 0:
##refer to the row index:
print(df.loc[0])

calories    420
duration     50
Name: 0, dtype: int64


Note: This example returns a Pandas Series.

In [25]:
#Return row 0 and 1:
##use a list of indexes:
print(df.loc[[0, 1]])

   calories  duration
0       420        50
1       380        45


Note: When using [], the result is a Pandas DataFrame.

#### Named Indexes
With the index argument, you can name your own indexes.

In [26]:
#Add a list of names to give each row a name:
import pandas as pd
data = {
    'calories': [420, 380, 390],
    'duration': [50, 45, 42]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)

      calories  duration
day1       420        50
day2       380        45
day3       390        42


#### Locate Named Indexes
Use the named index in the loc attribute to return the specified row(s).

In [27]:
#Return "day2":
##refer to the named index:
print(df.loc['day2'])

calories    380
duration     45
Name: day2, dtype: int64


#### Load Files Into a DataFrame
If your data sets are stored in a file, Pandas can load them into a DataFrame.

In [28]:
#Load a comma separated file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv("Universities.csv")
print(df)

            Univ   SAT  Top10  Accept  SFRatio  Expenses  GradRate
0          Brown  1310     89      22       13     22704        94
1        CalTech  1415    100      25        6     63575        81
2            CMU  1260     62      59        9     25026        72
3       Columbia  1310     76      24       12     31510        88
4        Cornell  1280     83      33       13     21864        90
5      Dartmouth  1340     89      23       10     32162        95
6           Duke  1315     90      30       12     31585        95
7     Georgetown  1255     74      24       12     20126        92
8        Harvard  1400     91      14       11     39525        97
9   JohnsHopkins  1305     75      44        7     58691        87
10           MIT  1380     94      30       10     34870        91
11  Northwestern  1260     85      39       11     28052        89
12     NotreDame  1255     81      42       13     15122        94
13     PennState  1081     38      54       18     10185      

### Pandas Read CSV

#### Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

In [31]:
#Load the CSV into a DataFrame:
import pandas as pd
df = pd.read_csv("Titanic.csv")
print(df.to_string())

     Class  Gender    Age Survived
0      3rd    Male  Child       No
1      3rd    Male  Child       No
2      3rd    Male  Child       No
3      3rd    Male  Child       No
4      3rd    Male  Child       No
5      3rd    Male  Child       No
6      3rd    Male  Child       No
7      3rd    Male  Child       No
8      3rd    Male  Child       No
9      3rd    Male  Child       No
10     3rd    Male  Child       No
11     3rd    Male  Child       No
12     3rd    Male  Child       No
13     3rd    Male  Child       No
14     3rd    Male  Child       No
15     3rd    Male  Child       No
16     3rd    Male  Child       No
17     3rd    Male  Child       No
18     3rd    Male  Child       No
19     3rd    Male  Child       No
20     3rd    Male  Child       No
21     3rd    Male  Child       No
22     3rd    Male  Child       No
23     3rd    Male  Child       No
24     3rd    Male  Child       No
25     3rd    Male  Child       No
26     3rd    Male  Child       No
27     3rd    Male  

Tip: use to_string() to print the entire DataFrame.

If you have a large DataFrame with many rows, Pandas will only return the first 5 rows, and the last 5 rows:

In [32]:
#Print the DataFrame without the to_string() method:
import pandas as pd
df = pd.read_csv("Titanic.csv")
df

Unnamed: 0,Class,Gender,Age,Survived
0,3rd,Male,Child,No
1,3rd,Male,Child,No
2,3rd,Male,Child,No
3,3rd,Male,Child,No
4,3rd,Male,Child,No
...,...,...,...,...
2196,Crew,Female,Adult,Yes
2197,Crew,Female,Adult,Yes
2198,Crew,Female,Adult,Yes
2199,Crew,Female,Adult,Yes


#### max_rows
The number of rows returned is defined in Pandas option settings.

You can check your system's maximum rows with the pd.options.display.max_rows statement.

In [34]:
#Check the number of maximum returned rows:
import pandas as pd
print(pd.options.display.max_rows)

60


In my system the number is 60, which means that if the DataFrame contains more than 60 rows, the print(df) statement will return only the headers and the first and last 5 rows.

You can change the maximum rows number with the same statement.

In [36]:
#Increase the maximum number of rows to display the entire DataFrame:
pd.options.display.max_rows = 9999
df = pd.read_csv('Universities.csv')
print(df)

            Univ   SAT  Top10  Accept  SFRatio  Expenses  GradRate
0          Brown  1310     89      22       13     22704        94
1        CalTech  1415    100      25        6     63575        81
2            CMU  1260     62      59        9     25026        72
3       Columbia  1310     76      24       12     31510        88
4        Cornell  1280     83      33       13     21864        90
5      Dartmouth  1340     89      23       10     32162        95
6           Duke  1315     90      30       12     31585        95
7     Georgetown  1255     74      24       12     20126        92
8        Harvard  1400     91      14       11     39525        97
9   JohnsHopkins  1305     75      44        7     58691        87
10           MIT  1380     94      30       10     34870        91
11  Northwestern  1260     85      39       11     28052        89
12     NotreDame  1255     81      42       13     15122        94
13     PennState  1081     38      54       18     10185      

### Pandas Read JSON

#### Read JSON
Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

In our examples we will be using a JSON file called 'data.json'.

In [37]:
#Load the JSON file into a DataFrame:
import pandas as pd
#df = pd.read_json(data.json)
#print(df.to_string())

#### Dictionary as JSON

JSON = Python Dictionary

JSON objects have the same format as Python dictionaries.

If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly:

In [38]:
#Load a Python Dictionary into a DataFrame:
import pandas as pd
data = {
    'Duration': {
        "0":60,
        "1":60,
        "2":60,
        "3":45,
        "4":45,
        "5":60
    },
    "Pulse": {
        "0":110,
        "1":117,
        "2":103,
        "3":109,
        "4":117,
        "5":102
    },
    "MaxPulse": {
        "0":130,
        "1":145,
        "2":135,
        "3":175,
        "4":148,
        "5":127
    },
    "Calories": {
        "0":409,
        "1":479,
        "2":340,
        "3":282,
        "4":406,
        "5":300
    }
}
df = pd.DataFrame(data)
print(df)

   Duration  Pulse  MaxPulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


### Pandas - Analyzing DataFrames

#### Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head() method.

The head() method returns the headers and a specified number of rows, starting from the top.

In [39]:
#Get a quick overview by printing the first 10 rows of the DataFrame:
df = pd.read_csv("Titanic.csv")
print(df.head(10))

  Class Gender    Age Survived
0   3rd   Male  Child       No
1   3rd   Male  Child       No
2   3rd   Male  Child       No
3   3rd   Male  Child       No
4   3rd   Male  Child       No
5   3rd   Male  Child       No
6   3rd   Male  Child       No
7   3rd   Male  Child       No
8   3rd   Male  Child       No
9   3rd   Male  Child       No


Note: if the number of rows is not specified, the head() method will return the top 5 rows.

In [40]:
#Print the first 5 rows of the DataFrame:
df = pd.read_csv("Titanic.csv")
print(df.head())

  Class Gender    Age Survived
0   3rd   Male  Child       No
1   3rd   Male  Child       No
2   3rd   Male  Child       No
3   3rd   Male  Child       No
4   3rd   Male  Child       No


There is also a tail() method for viewing the last rows of the DataFrame.

The tail() method returns the headers and a specified number of rows, starting from the bottom.

In [41]:
#Print the last 5 rows of the DataFrame:
print(df.tail())

     Class  Gender    Age Survived
2196  Crew  Female  Adult      Yes
2197  Crew  Female  Adult      Yes
2198  Crew  Female  Adult      Yes
2199  Crew  Female  Adult      Yes
2200  Crew  Female  Adult      Yes


#### Info About the Data
The DataFrames object has a method called info(), that gives you more information about the data set.

In [42]:
#Print information about the data:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2201 entries, 0 to 2200
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Class     2201 non-null   object
 1   Gender    2201 non-null   object
 2   Age       2201 non-null   object
 3   Survived  2201 non-null   object
dtypes: object(4)
memory usage: 68.9+ KB
None
