# Fundamentals of Information Systems

## Python Programming (for Data Science)

### Master's Degree in Data Science

#### Giorgio Maria Di Nunzio
#### (Courtesy of Gabriele Tolomei FIS 2018-2019)
<a href="mailto:giorgiomaria.dinunzio@unipd.it">giorgiomaria.dinunzio@unipd.it</a><br/>
University of Padua, Italy<br/>
2021/2022<br/>

# Lecture 8: I/O with <code>pandas</code>

## Overview

-  Accessing data is a necessary first step for any data scientist. 

-  We are going to see how to perform data input/output operations using <code>**pandas**</code>.

-  I/O might refer to: reading from/writing to text files (or other more efficient on-disk formats), accessing databases, interacting with network sources like web APIs, etc.

-  We will be exploring each of those separately (although we will be focusing more on text files).

## Loading Data into <code>DataFrame</code> Objects

-  There are many functions that allow <code>**pandas**</code> to read tabular data as a <code>**DataFrame**</code> object. 

-  Among those, <code>**read_csv**</code> and <code>**read_table**</code> are by far the ones you'll likely use the most.

## Optional Arguments to <code>read_*</code> Functions

-  **Indexing:** can treat one or more columns as the returned <code>**DataFrame**</code>, and whether to get column names from the file, the user, or not at all.

-  **Type inference and data conversion:** this includes the user-defined value conversions and custom list of missing value markers.

-  **Datetime parsing:** includes combining date and time information spread over multiple columns into a single column in the result.

-  **Iterating:** support for iterating over chunks of very large files.

-  **Unclean data issues:** skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

## Too Many Optional Arguments

-  Because of how messy data in the real world can be, some of the data loading functions (especially <code>**read_csv**</code>) have grown very complex over time. 

-  To avoid feeling ovewhelmed by the huge number of possible options, please refer to the [online pandas documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html).

-  Type inference is one of the more important features of these functions; that means you don't necessarily have to specify which columns are numeric, integer, boolean, or string.

## <code>read_csv</code>/<code>read_table</code>

-  We will explore some of the most important I/O features provided by <code>**pandas**</code> using an example.

-  To this end, we use a tabular data file located on a remote server.

-  To check out how such a file looks like, just click [here](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user).

-  Of course, you can save this file on your machine and load it locally from there with <code>**pandas**</code>.

-  By default, data is assumed to be **tab-separated** (<code>**'\t'**</code>).

In [None]:
import pandas as pd
import numpy as np

In [None]:
"""
Let's start with a real example on how to load a tabular data file using pandas.
"""
# Locate the dataset (in this case, we use a remote file located on an external server)
# Alternatively, you can save this file on your machine and load it locally from there.

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'

# The first line of the file represents the header, and each field
# is separated by a pipe
"""
We specify the url where the data is located, the character used to separate fields ('|')
and the name of the column to use as row label (otherwise, RangeInteger will be used)
"""
users = pd.read_csv(url, sep = '|', index_col = 'user_id')
print(users.tail(10))

In [None]:
"""
Suppose we have stored the file on our local machine.
"""
path = './data/user_occupations.txt'

# read users
users = pd.read_csv(path, sep='|', index_col='user_id')
print(users.head(10))

In [None]:
"""
Suppose the file does not contain any header line. We can still load the file
telling pandas there is no header AND we can also provide pandas with a list 
of names corresponding to the header we want to use.
"""
# This is the path to the same file yet without the header line
path_no_header = './data/user_occupations_no_header.txt'

# If the file does not contain the header as the first line
users = pd.read_csv(path_no_header, sep = '|')

# Row and column indices fall back to the default RangeIndex (i.e., integers)
print(users.head(10))

In [None]:
# If the file does not contain the header as the first line AND we want to
# specify ourselves the names of the columns (and, possibly, the row index as well)
users = pd.read_csv(path_no_header, 
                    sep = '|', 
                    header = None, 
                    names = ['user_id', 'age', 'gender', 'occupation', 'zip_code'],
                    index_col = 'user_id'
                   )

print(users.head(10))

In [None]:
# Sometimes, it may be useful to skip some records of the input file.
# Here, we skip the first, third and fourth (actual) record.
users_skip = pd.read_csv(path_no_header, 
                         sep = '|', 
                         header = None,
                         names=['user_id', 'age', 'gender', 'occupation', 'zip_code'],
                         index_col = 'user_id',
                         skiprows = [0, 1, 5])

print(users_skip.head(10))

## Handling Missing Values (*NA* or *Not Available*)

-  Missing data is usually either not present (i.e., empty string) or marked by some **sentinel** value. 

-  By default, <code>**pandas**</code> uses a set of commonly occurring sentinels, such as <code>**None**</code> and <code>**NaN**</code>.

-  The <code>**na_values**</code> is used to customize sentinel values by adding to the default ones either a list or set of strings to consider missing values.

-  Chech the guide about [working with missing data](https://pandas-docs.github.io/pandas-docs-travis/user_guide/missing_data.html)

In [None]:
"""
Suppose we want to mark as NA any entry whose value is 'N/A'.
"""
# Load again the data with the option for handling missing values (na_values)
users = pd.read_csv(path, 
                    sep='|', 
                    index_col = 'user_id', 
                    na_values = ['N/A'])

# Alternatively, we can define a dictionary of sentinels, i.e., a set for each column.
sentinels = {'age': ['inf', 'N/A'], 'zip_code': ['00000']}

users = pd.read_csv(path, sep='|', index_col='user_id', na_values = sentinels)
print(users.head(20))

### Reading Text Files in Pieces

-  When processing very large files, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.

-  If we want to only read out a small number of rows (avoiding reading the entire file), specify that with <code>**nrows**</code>.

In [None]:
"""
Suppose we want to just read 100 records from our file.
"""
# Specify the number of rows to be read
users_100 = pd.read_csv(path, sep='|', index_col='user_id', nrows = 100)



# Verify that we actually read that many rows
print("Number of observations (#rows) = {}".format(users_100.shape[0]))
users_100.head(10)

In [None]:
"""
Let's reload the dataset from the remote file.
"""
users = pd.read_csv(path, sep='|', index_col='user_id')
print(users.head())
users.head()

In [None]:
"""
Let's print out some information about the data we just loaded.
"""
print("Number of observations (#rows) = {}".format(users.shape[0]))

print("Number of fields (#columns) = {}".format(users.shape[1]))

print("Column names = [{}]".format(", ".join([c for c in users.columns])))

print("The index (i.e., the labels) is:\n{}".format(users.index))

print("The data types of each column are:\n{}".format(users.dtypes))

In [None]:
"""
Suppose we want to access a single column of the DataFrame.
"""
# Let's return the first 5 values of the 'occupation' column.
print(users['occupation'][:5]) # alternatively, use users['occupation'].head()
print()

# The same can be obtained using '.' notation
print(users.occupation[:5]) # alternatively, use users.occupation.head()

In [None]:
"""
Suppose we want to access a single column of the DataFrame.
"""
# Let's return the first 5 values of the 'occupation' column.
print(users.loc[:5, 'occupation'])

In [None]:
"""
Let's now create a deep copy of the loaded DataFrame 'users'.
Remember: assigning another name to the same DataFrame is simple a view.
For example, users_df = users makes users_df point to the same users. As such,
any change to the content of the DataFrame while working on users_df is reflected to users.
"""
# Make a deep copy of users
users_df = users.copy()
print(users_df.head())

In [None]:
"""
Let's add an extra column to the DataFrame and populate this column
with some values (e.g., a series)
"""
# Suppose we want to add an extra column 'salary', which we randomly populate
# with values in the range [5000, 1000000]
np.random.seed(142) # Initialize internal state of the random number generator

# set base salary
BASE_SALARY = 5000

# build values
values = pd.Series(np.random.randint(995000, size=users_df.shape[0]) + BASE_SALARY)
print(values.head())

In [None]:
"""
Before we actually "join" the Series we have just created with our users DataFrame,
we need the index of both objects to be aligned. Otherwise, there won't be any salary
associated with the DataFrame row index 943, as the Series index is shifted by 1 w.r.t.
the index of our DataFrame. Let's specify the index when creating our salary values.
"""
np.random.seed(42) # Initialize internal state of the random number generator
BASE_SALARY = 5000

# add index
values = pd.Series(np.random.randint(995000, size=users_df.shape[0]) + BASE_SALARY,
                  index=users_df.index)
print(values.head())

In [None]:
# Create a new column on the users_df DataFrame and populate this with
# the Series we just created
users_df['salary'] = values
print(users_df.head())

In [None]:
# We can access multiple columns of this new DataFrame as follows.
print("Occupation and Salary of the first 5 users:\n{}".
      format(users_df[['occupation', 'salary']].head()))

In [None]:
"""
Wait! We might not want to associate a salary to each entry!
For example, you don't want to assign a salary to any user aged less than 18
or anyone who doesn't have a job or is a student.
Let's see what are the set of occupations.
"""
users_df['occupation'].unique()

In [None]:
"""
Create a mask to assign a salary only to those users who are at least 18 AND
are not student nor unoccupied.
We therefore set salary to 0 for any of the users above
"""
mask = (users_df.age >= 18) & ~(users_df.occupation.isin(['student', 'none']))
# mask = (users_df.age >= 18) & (users_df.occupation != 'student') \
# & (users_df.occupation != 'none')
# mask = (users_df.age >= 18) & ~(users_df.occupation == 'student') \
# & ~(users_df.occupation == 'none')

In [None]:
mask = (users_df.age >= 18) & ~(users_df.occupation.isin(['student', 'none']))

In [None]:
mask

In [None]:
np.where(mask, users_df['salary'], 0)

In [None]:
#users_df['salary'] = users_df['salary'].where(mask, 0)
# Alternatively
users_df['salary'] = np.where(mask, users_df['salary'], 0)
users_df.head(10)

In [None]:
"""
Use integer slicing (special behavior to select rows)
"""
# Note that this integer slicing operator cannot be extended on both axis,
# as we did for 2-D numpy arrays. In other words, you cannot use the same
# syntax to slice over rows and columns with something like 
# users2[i_start:i_stop, j_start:j_stop]
# In order to use integer slicing on BOTH axis as above, we need to use the .iloc method
print("First 7 rows of the DataFrame:\n{}".format(users_df[:7]))

In [None]:
"""
Select all the users in the DataFrame whose salary is greater than 500k
"""
# This is a boolean mask which returns a Series containing either True or False
# corresponding to each entry index of the DataFrame depending on whether that entry
# has a salary which is greater than 500k or not.
mask = users_df.salary > 500000

print(mask.head(7))
print()
print("The list of first 5 users having salary greater than 500k is:\n{}"
      .format(users_df[mask].head()))

In [None]:
"""
Suppose I want to select only female users whose salary is greater than 500k
"""
mask = (users_df.salary > 500000) & (users_df.gender == 'F')
print(mask.head(7))
print()
print("The list of first 5 female users having salary greater than 500k is:\n{}"
      .format(users_df[mask].head()))

In [None]:
"""
Let's use loc and iloc methods to index both axis (i.e., rows and columns)
using either index/column labels (loc) or integers (iloc).
"""
# Note that in this special case, index (row) labels are integers...
# In cases like this, loc falls back to work like .iloc
print("user_id: 1 and 4 (ROWS); gender, salary, zip_code (COLUMNS):\n{}"
      .format(users_df.loc[[1, 4], ['gender', 'salary', 'zip_code']]))

print()

print("user_id: 1 and 4 (ROWS); 2nd, 5th, 4th (COLUMNS):\n{}"
      .format(users_df.iloc[[0, 3], [1, 4, 3]]))

In [None]:
"""
Suppose we want to sort the DataFrame by age (ascending) and salary (descending)
"""
print(users_df.sort_values(by=['age', 'salary'], ascending=[True, False]).tail())

In [None]:
"""
To make the above more meaningful, let's just consider only when salary is > 0
"""
print(users_df[users_df.salary > 0].sort_values(by=['age', 'salary'], 
                                                ascending=[True, False]).head(10))

In [None]:
"""
Suppose we want to see what is the average salary of the users.
"""
# Let's first consider ALL the users (also those who have 0 salary)
print("The average salary across ALL the users is: {:.2f}"
      .format(users_df.salary.mean()))

# Let's now filter out from the mean computation any user whose salary is 0
print("The average salary across all working users is: {:.2f}"
      .format(users_df[users_df.salary > 0].salary.mean()))

In [None]:
"""
Let's see what is the median age of the users in our DataFrame.
"""
print("The median age across ALL the users is: {}"
      .format(users_df.age.median()))

In [None]:
"""
Let's see what happens if we call 'describe()' on this DataFrame
"""
print(users_df.describe()) # Notice, only numeric columns are part of the description!

In [None]:
# Let's try to include all the columns in the description
print(users_df.describe(include = "all"))

In [None]:
"""
Sometimes it is useful to know how the values of a particular attribute (i.e., column)
is distributed over the data instances that we have.
"""
# Let's first see how many unique occupations are on our dataset (already saw this above)
unique_occupations = users_df.occupation.unique()

print("There are {} unique occupation values, which are as follows:\n[{}]"
      .format(unique_occupations.shape[0], 
              ", ".join([o.title() for o in np.sort(unique_occupations)])))

In [None]:
"""
Now let's see how many times each unique value of the 'occupation' column
appears across the dataset. In other words, we compute the frequency count (a.k.a. histogram)
of the 'occupation' attribute.
"""
print("Histogram of occupation values:\n{}"
      .format(pd.value_counts(users_df.occupation, sort=True)))

## Working with other text formats

-  Plain text files, such as <code>**.csv**</code> or <code>**.tsv**</code>, are not the only formats we might need work with.

-  Other possible "text" formats can be: **JSON** (**J**ava**S**cript **O**bject **N**otation), **XML**/**HTML**, etc.
