## Getting familiar with Pandas

Pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python. 

The DataFrame is one of Pandas' most important data structures. It's basically a way to store tabular data where you can label the rows and the columns. One way to build a DataFrame is from a dictionary.

In the exercise below you will be working with vehicle data from different countries. Each observation corresponds to a country and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on.

Three lists are defined in the script:
* `names`, containing the country names for which data is available.
* `dr`, a list with booleans that tells whether people drive left or right in the corresponding country.
* `cpc`, the number of motor vehicles per 1000 people in the corresponding country.

Each dictionary key is a column label and each value is a list which contains the column elements.

### Dictionary to dataframe

In [22]:
# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

In [23]:
# Import pandas as pd
import pandas as pd

In [24]:
# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }

In [25]:
# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)

In [26]:
# Print cars
print(cars)

   cars_per_cap        country drives_right
0           809  United States         True
1           731      Australia        False
2           588          Japan        False
3            18          India        False
4           200         Russia         True
5            70        Morocco         True
6            45          Egypt         True


Have you noticed that the row labels (i.e. the labels for the different observations) were automatically set to integers from 0 up to 6? To solve this a list `row_labels` has been created. You can use it to specify the row labels of the `cars` DataFrame. You do this by setting the index attribute of `cars`, that you can access as `cars.index`.

In [27]:
# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
cars.index = row_labels
print(cars)

     cars_per_cap        country drives_right
US            809  United States         True
AUS           731      Australia        False
JAP           588          Japan        False
IN             18          India        False
RU            200         Russia         True
MOR            70        Morocco         True
EG             45          Egypt         True


### CSV to Dataframe

Putting data in a dictionary and then building a DataFrame works, but it's not very efficient. What if you're dealing with millions of observations? In those cases, the data is typically available as files with a regular structure. One of those file types is the CSV file, which is short for "comma-separated values".

To import CSV data into Python as a Pandas DataFrame you can use `read_csv()`.

Let's explore this function with the same cars data from the previous exercises. This time, however, the data is available in a CSV file, named cars.csv. It is available in your current working directory, so the path to the file is simply 'cars.csv'.

In [28]:
url = "https://assets.datacamp.com/production/course_799/datasets/cars.csv"
# Import the cars.csv data: cars
cars = pd.read_csv(url)

# Print out cars
print(cars)

  Unnamed: 0  cars_per_cap        country drives_right
0         US           809  United States         True
1        AUS           731      Australia        False
2        JAP           588          Japan        False
3         IN            18          India        False
4         RU           200         Russia         True
5        MOR            70        Morocco         True
6         EG            45          Egypt         True


Your `read_csv()` call to import the CSV data didn't generate an error, but the output is not entirely what we wanted. The row labels were imported as another column without a name.

Remember index_col, an argument of read_csv(), that you can use to specify which column in the CSV file should be used as a row label? Well, that's exactly what you need here!

In [29]:
# Fix import by including index_col
cars = pd.read_csv(url, index_col = 0)

# Print out cars
print(cars)

     cars_per_cap        country drives_right
US            809  United States         True
AUS           731      Australia        False
JAP           588          Japan        False
IN             18          India        False
RU            200         Russia         True
MOR            70        Morocco         True
EG             45          Egypt         True


### Indexing and Selecting a Pandas Dataframe

You can index and select Pandas Dataframes in many different ways. The simplest, but not the most powerful way, is to use square brackets. The `cars` data is imported from a CSV files as a Pandas DataFrame.  To select only the cars_per_cap column from cars, you can use: 

cars['cars_per_cap']
cars[['cars_per_cap']]

The single bracket version gives a Pandas Series.
The double bracket version gives a Pandas DataFrame.

In [30]:
# Print out country column as Pandas Series
print(cars['country'])

US     United States
AUS        Australia
JAP            Japan
IN             India
RU            Russia
MOR          Morocco
EG             Egypt
Name: country, dtype: object


In [31]:
# Print out country column as Pandas DataFrame
print(cars[['country']])

           country
US   United States
AUS      Australia
JAP          Japan
IN           India
RU          Russia
MOR        Morocco
EG           Egypt


Notice the difference in the above two methods. The first returns a pandas series whereas the second returns a pandas dataframe. 

In [32]:
# Print out DataFrame with country and drives_right columns
print(cars[['country', 'drives_right']])

           country drives_right
US   United States         True
AUS      Australia        False
JAP          Japan        False
IN           India        False
RU          Russia         True
MOR        Morocco         True
EG           Egypt         True


Square brackets can do more than just selecting columns. You can also use them to get rows, or observations, from a DataFrame. The following call selects the first five rows from the cars DataFrame:

cars[0:5]

The result is another `DataFrame` containing only the rows you specified.

Pay attention: You can only select rows using square brackets if you specify a slice, like 0:4. Also, you're using the integer indexes of the rows here, not the row labels!

In [33]:
# Print out first 3 observations
print(cars[0:3])

     cars_per_cap        country drives_right
US            809  United States         True
AUS           731      Australia        False
JAP           588          Japan        False


Note: Even though you are using a single bracket, since we are selecting rows, the result is a dataframe and not a pandas series. A Pandas Series represents columns. A group of Pandas Series can be merged to form a Pandas DataFrame. We will learn more about the advantages of using a Series object later.

### Use  of `loc` and `iloc`

With loc and iloc you can do practically any data selection operation on DataFrames you can think of. 

`loc` is label-based, which means that you have to specify rows and columns based on their row and column labels.

`iloc` is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.


In [34]:
# Print out observation for Japan
print(cars.loc[['JAP']])

     cars_per_cap country drives_right
JAP           588   Japan        False


In [35]:
# Print out observations for Australia and Egypt
print(cars.loc[['AUS', 'EG']])

     cars_per_cap    country drives_right
AUS           731  Australia        False
EG             45      Egypt         True


`loc` and `iloc` also allow you to select both rows and columns from a DataFrame. 

In [36]:
# Print out drives_right value of Morocco
print(cars.loc[['MOR'], ['drives_right']])

    drives_right
MOR         True


In [37]:
# Print sub-DataFrame
print(cars.loc[['RU', 'MOR'], ['country', 'drives_right']])

     country drives_right
RU    Russia         True
MOR  Morocco         True


It's also possible to select only columns with loc and iloc. In both cases, you simply put a slice going from beginning to end in front of the comma:

In [38]:
# Print out drives_right column as Series
print(cars.loc[:,'drives_right'])

US      True
AUS    False
JAP    False
IN     False
RU      True
MOR     True
EG      True
Name: drives_right, dtype: bool


In [39]:
# Print out drives_right column as DataFrame
print(cars.loc[:,['drives_right']])

    drives_right
US          True
AUS        False
JAP        False
IN         False
RU          True
MOR         True
EG          True


NOTE: It is important to distinguish the use of square brackets when retrievng the columns. First one returns a Pandas Series and the second one returns a Pandas DataFrame.

In [39]:
# Print out cars_per_cap and drives_right as DataFrame
print(cars.loc[:,['cars_per_cap', 'drives_right']])

     cars_per_cap drives_right
US            809         True
AUS           731        False
JAP           588        False
IN             18        False
RU            200         True
MOR            70         True
EG             45         True
