# 1. From Beginner to Advanced pandas


**1.1. Intro to Dataframes**

The most fundamental aspect of Pandas is the DataFrame. This is how your data is stored, and it's ``a tabular format with rows and columns as you'd find in a spreadsheet or a database table``. So before we dive into some more advanced Pandas topics, let's review the DataFrame concept. 

After importing Pandas as PD, we're going to create a dictionary called scores. 

In [1]:
# library

import pandas as pd

Now a dictionary is a Python structure which stores key value pairs. In this dictionary, the keys are name, city, and score, and the values are lists, as denoted by the square brackets which are mapped to their corresponding key. After running this cell, we're going to turn this dictionary into a Pandas DataFrame using the DataFrame function. 

In [2]:
# create a dataset

scores = {"name": ['Ray', 'Japhy', 'Zosa'],
          "city": ['San Francisco', 'San Francisco', 'Denver'],
          "score":[75,92,94]
          }

In [3]:
# data frames

df=pd.DataFrame(scores)

Note the capitalization of the F in DataFrame. Great. Now let's see our data. Here you can see a table with name, city, and score as column headers, and three rows of corresponding data. Each column is a series. And notice the values zero, one, and two to the left. These are the index of our DataFrame and are useful for referencing and subsetting our data. 

In [4]:
df

Unnamed: 0,name,city,score
0,Ray,San Francisco,75
1,Japhy,San Francisco,92
2,Zosa,Denver,94


If we wanted to just return one column in our DataFrame, the notation is your DataFrame and then the column name or names in square brackets. Here let's take a look at score. Note, in this example, you can also call DF.score to return the same result. 

In [6]:
# retrieveing single column

df['score']

0    75
1    92
2    94
Name: score, dtype: int64

imilarly, you can also create new columns in your DataFrame by passing a new column name into the square brackets and assigning it. Here, we're creating a new column that combines the name and city columns. 

In [7]:
# creating a new column

df['name_city'] = df['name'] + '_' + df['city']

Now let's say we wanted to subset our data to only show those folks with scores above, say 90. To do that, we can create a Boolean expression which returns true for scores greater than 90 and only return those records where this condition is true. After running, we returned a DataFrame with just Japhy and Zosa's records. 

In [8]:
# filtering your data
df[df['score']>90]

Unnamed: 0,name,city,score,name_city
1,Japhy,San Francisco,92,Japhy_San Francisco
2,Zosa,Denver,94,Zosa_Denver


Also note our new column, name_city. All this only scratches the surface of what you can do when your data is in a DataFrame, but this is an excellent start for us to build on for future lessons.

**1.2.Top functions using pandas**

As you use pandas, you'll find there are certain functions that prove their worth time and time again. In this lesson, we'll cover some of the most important functions that you can use to get more from your data. Pandas is very flexible in that you can import data from a wide variety of data sources, including CSVs, Excel files, JSONs, databases, parquet files, you name it. 

For this lesson, we'll use panda's read CSV function to import the iris dataset as a data frame. This is a common sample dataset for practicing data science. Import pandas as pd. Next we'll read the iris CSV into a data frame called iris. 

In [None]:
# library

import pandas as pd

In [13]:
# import CSV file

iris = pd.read_csv('iris.csv')

Data frames have an attribute called shape which tells us the dimensionality of our data. By calling iris.shape, we can see the number of rows and columns that our data frame has. We have 150 rows and five columns in our data. 

In [14]:
# explore your data

iris.shape

(150, 5)

To preview our data, the head function will return the top rows of our data frame. Let's check out the top three rows of the iris data. You can see this data contains measurement data for different iris species. 

In [16]:
# top 3 values

iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


Similarly, you can also view the bottom rows with the tail function. 

In [17]:
# top 3 bottom values

iris.tail(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


``Datatypes``

Now, when working with data in pandas, you'll find that the data types that pandas assigns to your data is important and will influence what operations you can perform. I have found pandas to be pretty intelligent in how it assigns data types, but you'll want to check to be sure. To do this, call the dtypes attribute on your data frames. We see two data types represented in our data frame; float for all the measurement data, and object for the species. 


In [18]:
iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

``Subsetting your data with loc & iloc``

Often when using pandas, you'll want to subset your data. Loc allows you to subset your data based on index labels, so either the row indexes or column names. Iloc subsets by position, so the row number or column order. 

So here, we're going to subset our data frame based on row indexes three, four, and five, which are the fourth through sixth rows of our data. Note indexing begins at zero. 

In [19]:
iris.loc[3:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa


We can also return a single cell value by passing a row and column names separated by a comma. This returns 4.6, which is the measurement for sepal length for the row at index three in our data frame. 

In [20]:
iris.loc[3, 'sepal_length']

np.float64(4.6)

Using iloc, we can return the same value by referencing the same row index of three but a column index of zero.

In [None]:
iris.iloc[3,0]

Often, after you've done a whole host of data transformation with pandas, you want to export your data frame for analysis or visualization. A handy way to do this is the to_csv function. Note you may want to include index equal to false so the index isn't included in your CSV. 

In [23]:
# save your file

iris.to_csv('iris-output.csv', index=False)

Great. This will have generated a CSV named iris-output in your working directory. These functions are so beneficial in data analysis, and if you aren't using them currently, I highly recommend you give them a try.

# 1.3. Configuring options using pandas

So another Pandas feature I want you to take advantage of, is to configure your own options. Pandas has an option system which allows you to customize how the package works for you. Most often, this can be useful to change how results are displayed in Pandas. So here's an example. We'll start with this sample data frame of the top three nations in 2018 by global carbon dioxide emissions. 

In [1]:
# import library

import pandas as pd

In [16]:
# pandas options dataset

emissions = pd.DataFrame({"country": ['China', 'United States', 'India'],
                          "year": ['2018', '2018', '2018'],
                          'co2_emissions': [10060000000.0, 5410000000.0, 2650000000.0]
                        })

In [3]:
emissions

Unnamed: 0,country,year,co2_emissions
0,China,2018,10060000000.0
1,United States,2018,5410000000.0
2,India,2018,2650000000.0


The first option which comes in handy is to configure the maximum row size to display for a Pandas data frame. If we set the max row size to two, here's what we get. So you see two rows displayed, separated by an ellipses. That's what this option does. You can either use it to limit the screen space your displayed data frames take up or conversely to expand the row size, to reveal more of your data. Similarly, the max columns display option will reveal or hide columns. I find this most useful when viewing the head of a data frame that has a lot of columns as Pandas will truncate these by default. Now we see another set of ellipses between our first and third columns. Another trick worth checking out is to suppress scientific notation for displaying floats. 

In [12]:
pd.set_option('display.max_rows', 2)
emissions

Unnamed: 0,country,year,co2_emissions
0,China,2018,1.006000e+10
...,...,...,...
2,India,2018,2.650000e+09


You'll notice the scientific notation makes it difficult to readily compare carbon dioxide emissions figures in our table. By modifying the float format option, you can display these values normally and even add in a comma as 1,000 separator. So there you have it. Now you've got one more trick up your sleeve by customizing your options in Pandas.

In [29]:
pd.options.display.float_format = '{:,.2f}'.format

In [30]:
emissions

Unnamed: 0,country,year,co2_emissions
0,China,2018,10060000000.00
...,...,...,...
2,India,2018,2650000000.00
