# Lesson 2: Intro to Python Part 2

Today we want to learn more about visualization, dictionaries, pandas and control structures

## Visualization: Matplotlib
- Data visualizations are important because they allow us to understand data
- The better you understand data, the more insights you will find
- The better the visualization, the easier it will be to communicate your findings
- Matplotlib is one of the most important visualization packages in python

### Line Plot
You can create line plots to represent data, the example below shows an example of how you can use matplotlib to create a basic line plot using two lists of data
   - .plot method tells Python what to plot and how to plot it
   



In [4]:
# Import matplotlib subpackage pyplot as plt
import matplotlib.pyplot as plt

# Data of the US population in millions over the past 3 dacades
year = [1990, 2000, 2010, 2020]
population = [248.7, 281.3, 308.7, 332.6]

# Create a plot using the data provided 
# Year on the horizontal axis and Population on the vertical access
# # Plot function tells Python what to plot and how to plot it



# Display the plot
# You can call the show() function to display the plot



Lets think about a more truthful way to visualize this data is in a scatter plot...
  - Why is a scatter plot a more truthful representation of our data

In [3]:
# Create a scatter plot from the data provided

# Show the plot


### Customization
- There are many options available to customize your plots:
    - plot type 
    - Color
    - Label
    - Axis
    
- The choices depend on the data available and the story you want to tell

- Important steps to consider:
        1) Pick the type
        2) Color and size
        3) Label your axies 
        4) Modify the scale of each axis

In [2]:
# Import matplotlib subpackage pyplot as plt
# Note we do not really have to do this twice
import matplotlib.pyplot as plt

# Given data
time = [90,92,94,96]
quantity = [248.7, 281.3, 308.7, 332.6]

# Pick the type of graph you would like to make (scatter, x, y, color and size of your marker size)


# Label your axies


# Add a title


# We can modify the axis ticks on our x and/or y axis


# Show your graph


## Dictionaries and Pandas
### Dictionaries 
- Dictionaries are used to store data values in key:value pairs
- This data structure is a very useful because you can use the keys to index data
- To create a dictionary you first open a set of curly brackets
- Inside the curly brackets we have what we call Key Value pairs
- Keys: Elements that you want to define
- Values: Elements that define your keys. (these can be of any type float, integer, lists) 
- They are separated by a colon (:)

In [30]:
# Example dictionary of fruit weights in grams
fruit_dict = {'apple':195 , 'banana':120, 'mango':200}

print(fruit_dict)

{'apple': 195, 'banana': 120, 'mango': 200}


***Accessing Elements From Dictionary***
- With lists, we used indexing to access values within a list with a set of squared brackets
- Similarly, for dictionaries we can add square brackets to find any particular element within the dictionary


In [31]:
# Access the weight of the banana from the fruit dictionary


120


In [32]:
# EXERCISE 1
# Print the weight of mangos from the fruit dictionary

- The keys in the dictionary must be unique
- Keys have to be immutable objects; strings, booleans and integers are immutable. Lists for example are not immutable

***Adding/Updating Data to a Dictionary***

You can add new keys and their corresponding value to a dictionary

In [33]:
# Add orange and its corresponding weight to the dictionary


# Print the dictionary


# You can check if any particular key is in a dictionary
# Using "in"


{'apple': 195, 'banana': 120, 'mango': 200, 'orange': 140}
True


You can also update the values inside a list

In [34]:
# Print the current fruit dictionary


# Update the apple value


# Print the updated fruit dictionary


{'apple': 195, 'banana': 120, 'mango': 200, 'orange': 140}
{'apple': 200, 'banana': 120, 'mango': 200, 'orange': 140}


- You may remove keys from dictionaries

In [35]:
# Remove mango key from a dictionary
del(fruit_dict['mango'])

# Print fruit dictionary


{'apple': 200, 'banana': 120, 'orange': 140}


In [36]:
# EXERCISE 2
# Add berries with a weight of 4g to the fruit dictionary and print the weight of berries and apples

- Lists are indexed by numbers and the dictionaries are indexed by keys
- Just like we did with sublists, you may have dictionaries where the values for each key is another dictionary
- You can then access data values within the nested dictionaries

In [6]:
# Define a dictionary with nested dictionary as values
team_dict = {1: {'name': 'John', 'age': '27', 'gender': 'Male'},
          2: {'name': 'Marie', 'age': '22', 'gender': 'Female'},
          3: {'name': 'Luna', 'age': '24', 'gender': 'Female', 'married': 'No'},
          4: {'name': 'Peter', 'age': '29', 'gender': 'Male', 'married': 'Yes'}}

# Extract the age of person 4



### Pandas and DataFrames
- In data science we will work with large amounts of data
- To work with data in python we need some sort of tabulated data structures
- Pandas allows us to work with tabulated data very easily by using DataFrames
- Pandas DataFrames are 2-Dimensional table like data structures 

***Creating DataFrames***
- First you need to import the Pandas package
- Then you can create a DataFrame from a dictionary and other data sources such as the cloud or a CSV file from your computer


In [5]:
# Import the Pandas package
import pandas as pd

# Create a DataFrame from the team dictionary


# Print the DataFrame 



- This data looks a bit messy, what do the columns mean?
- We will work on how cleaning and structure data correctly

For now let's look at data that does not need any editing...


In [7]:
# Import data from a CSV file
# Stored in your computer you must have the car.csv file 


# Print the imported DataFrame


- Notice that the read_csv method automatically generated an index column on the left of the DataFrame automatically
- In most cases you would want to make sure that the first column of your DataFrame is the indexing column
- In our example, we have data corresponding to different countries, therefore the countries abbreviation should be the first column
- To do so you can add an argument on the read_csv method that specifies that our first column should be our indexing column


In [8]:
# Import the CSV file with the first column as the indexing column
# you can do this by setting the index_col argument to zero
cars = pd.read_csv('cars.csv', index_col = 0)

# print DataFrame



### Select Data From DataFrames
- We must often index and select data from our data frame
- Some of the options we have to do this include:
    - Brackets
    - loc
    - iloc
    
You can select a column in a DataFrame using squared brackets:

In [9]:
# Print the cars per capita column from the cars DataFrame


In [42]:
# EXERCISE 3
# print the full country names in the cars DataFrame


- Python prints the column along with the row labels
- The column was returned with a note at the bottom, stating that the dtype is int64
- We can then evaluate what type of data was returned from the square brackets using the type function type()

In [43]:
type(cars['cars_per_cap'])

pandas.core.series.Series

- Note that you are returned a series
- A series is a one dimentional DataFrame
- You can think of a DataFrame as a bunch of series put together

***Slicing***
- You may obtain a set of rows from the DataFrame using slicing (similar to lists)
- Note the formatting of slicing
    - DataFrame [start-inclusive, end-exclusive]
    - zero based indexing

In [10]:
# Obtain the 2nd and 3rd rows from the cars DataFrame (1 is inclusive, 3 is exclusive)


In [45]:
# EXERCISE 4
# Obtain the last two rows from the cars DataFrame

***loc and iloc***
- loc: allows you to select data based on labels
- iloc: allows you to select your data based on position

***Example using loc***

In [11]:
# get the row for Russia from the cars DataFrame


- You get a pandas series containing all the information for that label (Russia in this case)
- If you wanted to get a DataFrame from your car DataFrame that also included China you could use another set of brackets...

In [12]:
# Select a DataFrame from the cars DataFrame that contains Russia and Japan


- You could also obtain specific information from that label 
- For example, if you wanted to obtain cars_per_cap and the full country name you can add a coma and another list with the column names that we want

In [13]:
# Select the cars_per_cap and country columns for the for Russia and Japan


- You can select all rows and a set of columns by using a colon (:) instead of specifying what row label you want


In [14]:
# Select all rows and the cars_per_cap and country columns


***Example Using iloc***
- You can also select several rows and and columns based on index location by using iloc

In [15]:
# get the row for Russia from the cars DataFrame


In [16]:
# Select a DataFrame from the cars DataFrame that contains Russia and Japan


In [17]:
# Select a DataFrame from the cars DataFrame that contains India and Morocco


In [18]:
# Select the cars_per_cap and country columns for the for Russia and Japan


In [19]:
# Select all rows and the cars_per_cap and country columns


## Logic, Control Flow and Filtering
### Comparison Operators
- Comparison operators allow us how tell how two python values relate and provide a result in boolean form (True or False)
- "<" is defined as less than
- ">" is defined as more than
- Trick: if you know the game pacman, imagine the comparison operator is pacman's mouth. Packman will always want to eat the biggest value.

In [20]:
# Simple comparison operator, is 2 more than 3?


In [21]:
# You can also check if two values are equal, is 2 equal 4?


In [22]:
# Same operation can be done for less than or equal to three operator


- You can also use comparison operators directly on variables

In [23]:
# Define variables and test their values using comparison operators


In [24]:
#Note that '==' checks to see if two values are equal, while '=' set a variable equal to a value


### Filtering
- You may apply logical operators to numpy arrays to filter for values that you need
- Lets filter (or remove) all values that are less than 13

In [25]:
# Import numpy
import numpy as np

# Define numpy array 
new_array = np.array([1,23,45,12,34,44,55,2,4])

# Create a filtering boolean array 



In [26]:
# Using that boolean array we may now extract the values we want


# print


### Boolean Operators
- You may combine comparison operation with boolean operators
- The main boolean operators are:
    - **and**: checks two booleans and returns True if both booleans are the same. It returns False if both booleans are different:
        - True and True = True
        - False and False = True
        - True and Talse = Talse
    - **or:** Checks to see if one of the booleans is true
        - True or True = True
        - True or False = True
        - False or False = False
    - **not:** negates the boolean value you use it on
        - not True = False
        - not False = True

In [27]:
# Suppose you have a variable equal to 12

# Check if x is greater than 5


# You may chekc if x is less than 15


- You may check if both conditions are met at the same time with boolean operators

In [28]:
# Test with boolean operators


- Check to see if either condition is met

In [29]:
# Check to see if either x > 3 or x < 7


- Negate an operation

In [30]:
# negate the operation x>3 (Note that x = 12)


### if, elif, else Conditional Statements

- You can now utilize the boolean and comparison operators along with the conditional statements to make your python code behave in any way you want.
- Conditional statement allow you to write data that perfomrs tasks based on a condition
- Format
    - if condition:
    
            expression...must be indented with 4 spaces or a "tab"
- The expression is a function (action) that your code will perform on your data
- To exit the conditional statement continue writing your code without indentation
- **elif**: the elif statement allows you to add more conditions if the first one is not met
- **else**: the else statement allows you to run a seperate function or perform a different action if the initial condition(s) is not met
- Format:
    - if condition :
         
          expression
             
      elif condition:
      
          expression
      
      else:
         
          expression
             
- Once a conditional statement has been met, the remaining statements are skipped.

***Example***: write a code to determine the size category of a house. (small, medium, large)

In [31]:
#Example

# Define Area
area = 10.0

# Use if,elif,else to determine if the size category


In [68]:
# EXERCISE 5
# Create an if statement that checks if Felipe's favorite sport is tennis

felipe_fav_sport = 'soccer'


## Loops
- Loops are techniques that you can use to  execute Python code several times over.


### While loop
- For **if** statements, python checks the condition, then decides to execute or not
- For **if** statement, python only goes over the code once
- **while** executes the code if the condition is true. But as opposed to the **if** statement , the **while** look will continue executing as along as the condition is true
- While loops are not used very often but they can be very useful
- "repeating an action until a particular condition is met"


***Example***: 

In [11]:
x = 1

while x < 6:
    print (x)
    x = x+1
    

1
2
3
4
5


- Notice how the while loop provided us with several answers, one for every loop, until **x** == 6
- With while loops you can write an infinite loop...
  
    - BE CAREFUL: The loop below continues indefinately because the loop's condition is always met.

In [32]:
### infinite loop

# while y > 1:
#    y = y + 1
#    print(y)

### For loop
- The **for** loop allows you to iterate (repeate a process) on a group of items
- Format:

    for var in seq:
        expression
        
- It can be read as "for each variable in sequence execute the expression"
- "var" can be any variable you want. It is used in your expression to complete the iteration


Example: 

- Suppose that you want to print each number seperately in the series s...
- you may do so this way:

In [71]:
s = [1,3,5,7,8]

print(s[0])
print(s[1])
print(s[2])
print(s[3])
print(s[4])

1
3
5
7
8


- But doing this would be tedious and time consuming
- In this case you may use a **for** loop
- A **for** loop allows you to visit item by item in your list:

In [72]:
for item in s:
    print(item)

1
3
5
7
8


- You may modify your code to perform multiple actions during each iteration
- Once you have looped through your data one time, the loop ends

In [33]:
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna' }
          
# Iterate over europe (k would be lable, v would be value)


## LAB - for loops and Pandas

### Lab 1

Import *baseball_data.csv* as a DataFrame and complete the following:
- Iterate through the DataFrame to print the height column and add the word 'inches' after each value
- Iterate through the baseball DataFrame to count the number of players that are 73 inches

In [None]:
#loop over a data frame
#import pandas as pd
import pandas as pd

#read the baseball_data file

#iterate through the DataFrame, print the height of each player and add the word 'inches' after it

#create a counter for players that are 73in tall
# Define a 'counter' vearialbe that is equal to zero
# Using a forloop iterate over the height of each player, check the players height and if the players height
# is more than 73 inches, add a unit the counter
counter = 0


        
print('There are', , 'players that are 73 inches in height')

# Perform statistical analysis
# Using .mean(), .std() and .min() methods find the average, 
# standard deviation and smallest value of the heights of all players in our baseball dataset
        


### Lab 2

The dataset cars.csv is provided to you. This dataset containst 3 types of information, the country, the country's name abbreviation, the number of cars per 1000 people (cars_per_cap) and whether or not the people from each respective country drive on the right side of the sreet.

- Import *cars.csv* dataset as a DataFrame 
- Iterate through the DataFrame and perform two print() calls: one for the row label and one to print out all the row contents
- Iterate through the DataFrame and print the names of the countries that have more than 500 cars / capita
- Add an additional column named COUNTRY and add all the country's names in upper case under this new column

In [2]:
# Import pandas


# Import cars data, create a dataframe called cars_df


# print car_df to visualize the data structure


# Write a for loop that iterates over the rows of cars_df and on each iteration 
# Using a for loop print the name of every country listed

    
# Using a comparison operators print the name of each country in which the cars_per_cap is more than 500
# Complete this operation without using a for loop
# Note you might need to do research on how to use a boolean index to extract data and complete this task
