In pandas you can read data from various file formats such as CSV, Excel, JSON, HTML, SQL, and many more. First we need to locate the file.

In [1]:
import os

os.getcwd() # Check the current working directory

'/Users/teddy/Documents/GitHub/Week6/McKinneyBook/Chapter678'

In [5]:
os.chdir('/Users/teddy/Documents/GitHub/Week6/McKinneyBook/Chapter5')
os.getcwd() # Check the updated working directory

'/Users/teddy/Documents/GitHub/Week6/McKinneyBook/Chapter5'

In [7]:
os.listdir('/Users/teddy/Documents/GitHub/Week6/McKinneyBook/Chapter5/titanic') # List all files in the current directory

['test.csv', 'train.csv', 'gender_submission.csv']

Data is commonly stored in CSV and TSV formats. Let's read a CSV file using pandas.

In [9]:
import pandas as pd

titanic_train = pd.read_csv('titanic/train.csv')

titanic_train.head() # Display the first few rows of the DataFrame

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


To read a TSV file, you can use pd.read_tab() instead of pd.read_csv().

To read excel files, you can use pd.read_excel('file/path', sheet_name = 'nameOfSheet').

The internet gives you access to more data than you could ever hope to analyze
The easiest way to go about this is to simply download a csv, tsv, or excel file from a website and then read it into pandas.

Alternatively when can copy and paste from a website with out clipboard and use      pd.read_clipboard() to read data directly from the clipboard. Better yet copy the data and paste it into excel and then use pd.read_excel() to read it.

Pandas can also read html tables from websites. To do this, you need to know the url of the website and the html table you want to read. The read_html() function does its best job of reading tables from websites but it may not always work perfectly.

All you need to know to write data to an easy format like csv is the df.to_csv() function.

Not all data formats were covered in this course but pandas can handle a wide range of file formats and if it can't there is a library that can convert the data to a format that pandas can handle.

# I'm going to put some notes about functions in here because they are important to know and my notes for this chapter are   light.

In [3]:
def sum_3_items(x, y, z, print_args = False): # *args makes this function accept any number of arguments
    if print_args:  # If print_args is True, print the arguments
        print(x, y, z) # Print the arguments in the function
    return x + y + z # Return the sum of the arguments

sum_3_items(14, 22, 37) # Call the function with three arguments

73

In [4]:
sum_3_items(14, 22, 37, True) # Call the function with three arguments and print the arguments

14 22 37


73

In [1]:
def sum_many_args(*args): # *args makes this function accept any number of arguments
    print(type(args)) # args is a tuple
    return sum(args) # Return the sum of the arguments

sum_many_args(1, 2, 3, 4, 5) # Call the function with five arguments

<class 'tuple'>


15

In [5]:
def sum_keywords(**kwargs): # **kwargs makes this function accept any number of keyword arguments
    print(type(kwargs)) # kwargs is a dictionary
    return sum(kwargs.values()) # Return the sum of the values in the dictionary

sum_keywords(mynum=100, yournum=200) # Call the function with two keyword arguments

<class 'dict'>


300

Function Documentation

If your writing a function that is to be used in the future its a good idea to supply some documentation that explains how it works. 

In [6]:
import numpy as np

def rmse(predicted, targets):
    """
    Computes the root mean squared error of two numpy ndarrays
    
    Args:
        predicted (numpy.ndarray): The predicted values
        targets (numpy.ndarray): The actual values
    
    Returns:
        float: The root mean squared error
    """
    return(np.sqrt(np.mean((targets-predicted)**2)))

Lambda functions

Python provides a way to define small anonymous functions using the lambda keyword. These are useful when you want to use a function only once or when you need a quick function to use without defining it first.

In [None]:
lambda x, y: x + y # This is a lambda function that adds two numbers

In [9]:
my_function2 = lambda x, y: x + y # Assign the lambda function to a variable and give it a name

my_function2(3, 4) # Call the function with two arguments
#not the point of a lambda

7

Lambdas can hav names but they truly shine when they are used as arguments to other functions.

In [10]:
#Example using map() without a lambda function

def square(x):
    return x**2

my_map = map(square, [1, 2, 3, 4, 5]) # Apply the square function to each element in the list

for item in my_map:
    print(item)

1
4
9
16
25


In [11]:
#Example using map() with a lambda function

my_map = map(lambda x: x**2, [1, 2, 3, 4, 5]) # Apply the lambda function to each element in the list

for item in my_map:
    print(item)
#Example using filter() without a lambda function

1
4
9
16
25


The lambda function is shorter and more readable than the equivalent map() function.

Im now going to review list comprehensions.

In [13]:
my_list = []
for i in range(10):
    my_list.append(i**2)
print(my_list)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [14]:
#Example using list comprehension

print([num **2 for num in range(10)])

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


The list comprehension is expressed in one line as opposed to the for loop.

We can also add logical checks to out comprehensions.

In [15]:
print([num **2 for num in range(10) if num % 2 != 0])

[1, 9, 25, 49, 81]


We can put more than one for loop in a list comprehension, such as to construct a list from two different iterables.

In [16]:
print([a+b for a in "life" for b in "study"]) # Combine two strings

['ls', 'lt', 'lu', 'ld', 'ly', 'is', 'it', 'iu', 'id', 'iy', 'fs', 'ft', 'fu', 'fd', 'fy', 'es', 'et', 'eu', 'ed', 'ey']


We can also put one list comprehension inside another list comprehension.

In [17]:
print([letters[1] for letters in [a + b for a in 'life' for b in'study']]) # Get the second character from each string

['s', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y']


Too many nested structures on a single line can make the code hard to read.
Instead its recommended to use separate lines for each nested structure for easier readability.

In [18]:
combined = [a + b for a in 'life' for b in'study'] #Combine two strings
print([letters[1] for letters in combined]) # Get the second character from each string

['s', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y']


Now lets review dictionary comprehensions.

In [19]:
words = ['life','is','study']

word_length_dict = {}

for word in words:
    word_length_dict[word] = len(word)

print(word_length_dict)

{'life': 4, 'is': 2, 'study': 5}


In [24]:
print({word: len(word) for word in words})

{'life': 4, 'is': 2, 'study': 5}


its common to create a dictionary from two different ordered sequences.

In [25]:
words = ['life','is','study'] 
word_lengths = [4, 2, 5]
pairs = zip(words, word_lengths) # Combine two sequences into a list of tuples

for item in pairs: # Print each tuple
    print(item)

('life', 4)
('is', 2)
('study', 5)


Lets use zip inside a dictionary comprehension.

In [26]:
words = ['life','is','study'] 
word_lengths = [4, 2, 5]
print({key:value for (key,value) in zip(words, word_lengths)})

{'life': 4, 'is': 2, 'study': 5}


A set comprehension just cause

In [27]:
print({num for num in range(7) if num % 2 == 0})

{0, 2, 4, 6}
