# Using Kaggle Learn Pandas 

**Chapter "Creating, Reading and Writing"**

In [37]:
import pandas as pd
import os

## Creating data

There are two core objects in pandas:
    1. the **DataFrame**  
    2. the **Series**
    
**DataFrame**

A DataFrame is a table. it contains an array of individual *entries*, each of which has a certain *value*. Each entry corresponds to a row (or *record*) and a *column*.

For example, consider the following simple DataFrame:

In [26]:
pd.DataFrame({'Yes': [50, 21], 'No':[131, 2]})

Unnamed: 0,Yes,No
0,50,131
1,21,2


In this example, the "0, No" entry has the value of 131. The "0, Yes" entry has a value of 50, and so on. 

DataFrame entries are not limited to intergers. For instance, here's a DataFrame whose valuesa are strings: 

In [27]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good', 'Bland']})

Unnamed: 0,Bob,Sue
0,I liked it.,Pretty good
1,It was awful.,Bland


When you are using the pd.DataFrame() constructor to generate these DataFrame objects.The syntax for declaring a new one is a dictionary whose keys are the column names ( Bob and Sue in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFramem and the one you are most likely to encounter.

The dictionary-list cosntructor assigns values to the *column labels*, but just uses an ascending count from 0(0, 1, 2, 3, ...) for the *row labels*. Sometimes this is OK, but oftentimes we will want to assign tghese labels ourselves. 

The list of row labels used in a DataFrame is known as an **Index**. We can assign values to it by using an index parameter in our constructor:

In [28]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
             'Sue': ['Pretty good.', 'Bland.']},
             index=['Mofongo al ajillo', 'chuletas Kan Kan'])

Unnamed: 0,Bob,Sue
Mofongo al ajillo,I liked it.,Pretty good.
chuletas Kan Kan,It was awful.,Bland.


### Series ###

A *Series* is, in essence, a single column of a DataFrame. So you can assign row lables to the Series the same way as before, using an index parameter. However, a Series does not have a column name, if only has one overall name:

In [29]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Mofongo al ajillo')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Mofongo al ajillo, dtype: int64

The Series and the DataFrame are intimately related. it's helpful to think of a DataFrame as actually being just a bunch of Series "glued togehter". We'll see more of this in the next section of this tutorial. 


### Reading data files ###

Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.


Data can be stored in of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you openn a CSV file you get something that looks like this: 

 >Product A, Product B, Product C, 
 30, 21, 9,
 35, 34, 1, 
 41, 11, 11<
 
 So a CSV file is a table of values separated by commas. Hence the name:"Comma Separated Values", or CSV.
 
Let's now set aside our tiy datasets and see what a real dataset looks like when we read it into a DataFrame. We'll use the pd.read_csv() function tor ead the data into a DataFrame. This goes thusly:


In [41]:
wine_reviews = pd.read_csv('/Users/jeanzayas/Desktop/Divergence/DATA ANALYSIS/Learning/Wine/winemag-data-130k-v2.csv')

Wer can use the >shape< attribute to check how large the resulting DataFrame is:

In [43]:
wine_reviews.shape

(129971, 14)

So our new DataFrame has 130,000 records split across 14 different columns. That's almost 2 million entries!

We can examine the contents of the resultant DataFrame using the head() command, which grabs the first five rows:

In [44]:
wine_reviews.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


The pd.read_csv() function is well-endowed, with over 30 optional parameters you can specify. For example, you can see in this dataset that the CSV file has a built-in indec, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an index_col.

In [45]:
wine_reviews = pd.read_csv('/Users/jeanzayas/Desktop/Divergence/DATA ANALYSIS/Learning/Wine/winemag-data-130k-v2.csv', index_col=0)
wine_reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


# MY TURN

## Exercises

**1** 

In the cell below, create a DataFrame fruits

In [48]:
# Your code goes here. Create a dataframe matching the above diagram and assign it to the variable fruits.
fruits = pd.DataFrame({'Apples': [30], 'Bananas': [21]})

# Check your answer
fruits

Unnamed: 0,Apples,Bananas
0,30,21


**2**

Create a dataframe fruit_sales below:

In [50]:
fruits_sales = pd.DataFrame({'Apples': [35, 41], 'Bananas': [21, 34]}, index=['2017 Sales', '2018 Sales'])

fruits_sales.head()

Unnamed: 0,Apples,Bananas
2017 Sales,35,21
2018 Sales,41,34


**3**

Create a variable ingredients with a Series that looks like: 

Flour      4 cups
Milk       1 cup
Eggs       2 large
Spam       1 can 
Name: Dinner, dtype: object


In [55]:
import pandas as pd

ingredients = pd.Series(['4 cups', '1 cup', '2 large', '1 can'], index=['Flour', 'Milk', 'Eggs', 'Spam'], name='Dinner')

print(ingredients)


Flour     4 cups
Milk       1 cup
Eggs     2 large
Spam       1 can
Name: Dinner, dtype: object


### 4

Read the following csv dataset of wine reviews into a DataFrame called *reviews*:

In [87]:
reviews = pd.read_csv('/Users/jeanzayas/Desktop/Divergence/DATA ANALYSIS/Learning/Wine/winemag-data_first150k.csv', index_col=0)

In [88]:
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


### 5 

Run the cell below to create and display a DataFrame called *animals*:

In [68]:
animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])
animals

Unnamed: 0,Cows,Goats
Year 1,12,22
Year 2,20,19


## Indexing, Selecting & Assigning 

Pro data scientist do this dozen of times a day. You can, too!

### Introduction

Selecting specific values of a pandas DataFrame or Series to work on is a implicit step in almost any data operation you'll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively. 

### **Native Accessor** 


Native Python objects provides good ways of indexing data. Pandas carries all of these over, which helps make it easy to start with. 

Consider this DataFrame:

In [69]:
reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
...,...,...,...,...,...,...,...,...,...,...
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset


In Python, we can access the property of an object by accessing it as an attribute. A *book* object, for example, might have a *title* property, which we can access by calling *book.title*. Columns in a pandas DataFrame work in much the same way. 

Hence to access the *country* property of *reviews* we can use:

In [70]:
reviews.country

0             US
1          Spain
2             US
3             US
4         France
           ...  
150925     Italy
150926    France
150927     Italy
150928    France
150929     Italy
Name: country, Length: 150930, dtype: object

If we have a Python dictionary, we can access its values using the indexing *([])* operator. We can do the same with columns in a DataFrame:

In [71]:
reviews['country']

0             US
1          Spain
2             US
3             US
4         France
           ...  
150925     Italy
150926    France
150927     Italy
150928    France
150929     Italy
Name: country, Length: 150930, dtype: object

These are the two ways of selecting a specific Series out of a DataFrame. neither of them is more or less syntactically valid than the other, but the indexing operator[] does have the advantage that it can handle column names with reserved characters in them (e.g. if we had a country providence column, reviews.country providence wouldn't work).

Doesn't a pandas Series look kind of a like a fancy dictionary? It pretty much is, so it's no surprise that, to drill down to a single specific value, we need only use the idexing operator [] once more:

In [73]:
reviews['country'][0]

'US'

## Indexing in pandas

The indexing operator and attribute selection are nice because thy work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you're supposed to be using. 

####  Index-based selection 

Pandas indexing works in one of two paradigms. The first is index-based selection:
    selecting data based on its numerical position in the data. iloc follows this paradigm. 
    
To select the first row of data in a DataFrame, we may use the following:

In [74]:
reviews.iloc[0]

country                                                       US
description    This tremendous 100% varietal wine hails from ...
designation                                    Martha's Vineyard
points                                                        96
price                                                      235.0
province                                              California
region_1                                             Napa Valley
region_2                                                    Napa
variety                                       Cabernet Sauvignon
winery                                                     Heitz
Name: 0, dtype: object

Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first,row-second. 

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:

In [78]:
reviews.iloc[:, 0]

0             US
1          Spain
2             US
3             US
4         France
           ...  
150925     Italy
150926    France
150927     Italy
150928    France
150929     Italy
Name: country, Length: 150930, dtype: object

On it's own, the : operator which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the *country* column from just the first, second, and third row, we would do:

In [79]:
reviews.iloc[:3, 0]

0       US
1    Spain
2       US
Name: country, dtype: object

Or, to select just the seocnd and thrid entries, we would do:

In [80]:
reviews.iloc[1:3, 0]

1    Spain
2       US
Name: country, dtype: object

It's also possible to pass a list

In [81]:
reviews.iloc[[0, 1, 2], 0]

0       US
1    Spain
2       US
Name: country, dtype: object

Finally, it's worth knwoing that negative numbers can be used in selection. this will start counting forwards from the end of the values. So for example here are the last five elements of the dataset. 

In [82]:
reviews.iloc[-5:]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder





#### label-based selection

The second paradigm for attribute selection is the one followed by the *loc* operator:
**label-based selection**. in this paradigm, it's the data index value, not its position, which matters. 


For example, to get the first entry in reviews, we would now do the folowing:

In [83]:
reviews.loc[0, 'country']

'US'

*iloc* is conceptually simplet than *loc* because it ignores the dataset's indices. When we use *iloc* we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. *loc*, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indicesm it;s usually easier to do things using *loc* instead. For example, here's one operation that's much easier using *loc*:

In [84]:
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

KeyError: "['taster_name', 'taster_twitter_handle'] not in index"

In [85]:
print(reviews.columns)

Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'variety', 'winery'],
      dtype='object')
