# Working with tabular data

## Introducing `numpy`

In the last module you saw some of the limitations for quick quantiative analysis using built-in Python functionalities:

In [1]:
# It's hard to calculate on lists!
my_list = [4,1,5,2]

my_list * 2

[4, 1, 5, 2, 4, 1, 5, 2]

Fortunately you also learned about packages -- they'll come to our rescue!

Let's store these same numbers in what's called a `numpy` *array*. 

This involves importing the `numpy` package. 

`numpy` is short for "numerical Python."

Generally when we are using a package for the first time, we need to do one of these:

In [1]:
# Install numpy
#!pip install numpy



However, `numpy` was installed already when we installed `pandas`.

We *do* still need to import `numpy` before using it: 

In [9]:
import numpy

[4 1 5 2]
[4 1 5 2]


We can use `numpy.array()` to create an array.

In [None]:
my_array = numpy.array([4,1,5,2])
print(my_array)

We can also convert our list to an array:

In [None]:
my_list_to_array = numpy.array(my_list)
print(my_list_to_array)

`numpy` arrays work in many ways like ranges of a spreadsheet...

In [10]:
# Isn't this what you were expecting earlier?

print(my_array * 2)
print(my_list_to_array * 2)

[ 8  2 10  4]
[ 8  2 10  4]


Lists and arrays may *look* the same to you, but they are different data types to Python:

In [14]:
my_list = [4,1,5,2]
my_array = numpy.array([4,1,5,2])

print(type(my_list))
print(type(my_array))

<class 'list'>
<class 'numpy.ndarray'>


Based on what we're seeing, we may want to be calling for `numpy` *quite* often. 

Let's look at a cool "hack" for doing so...

### Aliasing modules

Remember that each time we use a function or method associated with `numpy`, we need to tell Python where to look for it: 

In [11]:
# Create another array...
my_other_array = numpy.array([4,16,25,100])

# numpy has a square root function of its own...
numpy.sqrt(my_other_array)

array([ 2.,  4.,  5., 10.])

I am already getting sick of typing `numpy` each time I want to use something from it! Can't we make this easier?

Yes. Yes, we can.

Turns out we can temporarily rename, or *alias*, the `numpy` module when we import it. We will use the format:

```
import [name of module] as [alias]
```

`np` is a popular alias for `numpy`. Rather than calling for `numpy` each time you are using methods from that library, you can simply type `np`. 



In [2]:
import numpy as np

# Create another array...
my_other_array = np.array([4,16,25,100])

# numpy has a square root function of its own...
np.sqrt(my_other_array)

array([ 2.,  4.,  5., 10.])

### Drill

Take a shot at assigning an array and finding its square root using this aliasing method. 

In [None]:
# Import and alias the module

import ___ ___ ___

# Create an array
my_new_array = ___.___([36, 49, 64, 81])

# Take its square root
np.___(___)

Aliasing saved you some keystrokes, huh?

![Life hackz](life-hackz.gif)

## Accessing and reshaping arrays

Python indexes *everything* at zero, not just lists. This includes `numpy` arrays!

In [15]:
my_list = [4,1,5,2]

# Access first element of the array
print(my_list[1])

# Oh sorry... NOW I'm accessing the first element! 🤦‍♂️
print(my_list[0])

1
4


You've already sweated through zero-based indexing, so let's move on... to two-dimensional arrays. 

(You will see that you'll never truly escape zero-based indexing in Python, however... 😼)

## Two-dimensional arrays in `numpy`

So far, we have been working on one-dimensional sets of data. But what if we wanted to mix that up? 

![Illustration of numpy arrays](numpy-arrays.png)



*Source: Nunez-Iglesias, Juan, Stéfan Van Der Walt, and Harriet Dashnow. *Elegant SciPy: The Art of Scientific Python.* O'Reilly Media, 2017.*


`numpy` can create three-dimensional arrays, but let's focus on two: this is a familiar way to shape data as it's how data is often is stored in spreadsheets (as rows and columns).


We can create a two-dimensional array in `numpy` with the `array()` function. This time we will 

In [3]:
# Create a two-dimensional array with `np.array()`

my_2d_array = np.array([[3,4,1],[2,5,0]])
print(my_2d_array)

[[3 4 1]
 [2 5 0]]


We can also re-shape an existing one-dimensional array into a two-dimensional array using `np.reshape()`

In [2]:
# New array
my_array = np.array([1,2,3,4,5,6])
print(my_array)

# Let's make a 2 x 3 array
my_reshaped_array = np.reshape(my_array, (2, 3))
print(my_reshaped_array)

[1 2 3 4 5 6]
[[1 2 3]
 [4 5 6]]


A two-dimensional array is starting to look like the kind of dataset that you might actually work with as a spreadsheet user, with rows and columns.

## Inspecting our arrays

Variables in Python carry different `attributes` which we can find using the format 

`variable.[attribute]`


Some attributes we can use to learn more about our `numpy` arrays are:

`shape`: gives us the dimensions of the array.  
`size`: gives us the number of elements of the array.   
`dtype`: gives us the data type of the elements of the array.   

In [8]:
print(my_reshaped_array.shape)
print(my_reshaped_array.size)
print(my_reshaped_array.dtype)

(2, 3)
6
int32


### Indexing and slicing our arrays

Remember when I said that zero-based indexing never really goes away? I wasn't kidding. 

Now we have to index on *two* counts: the row and the column. Our indexing of two-dimensional `numpy` arrays will look like this 

`np_array[row_number, column_number]`

Some examples:



In [4]:
print(my_reshaped_array)

# Get the value in first row, first column
# Never forget zero-based indexing!
my_reshaped_array[0,0]


[[1 2 3]
 [4 5 6]]


1

In [5]:
# What about the second-last row/second-last column?
my_reshaped_array[-2,-2]

2

In [7]:
# What about up through the second row and second column?
my_reshaped_array[:2,:2]

array([[1, 2],
       [4, 5]])

Indexing will be important when accessing and indexing data, and in Python it will **always be zero-based**!

# DRILL

Practice your `numpy` skills by operating on a large array. 

I will get you started; complete the operations based on what the comments are asking for. 

In [18]:
# Don't worry about this part -- I am reading the file into Python.
# You will learn how to read files into Python in the next unit. 
my_array = np.genfromtxt('numpy-drill.csv')
print(my_array)

[47. 21. 23. 24. 45.  6. 30. 43. 45. 23.  2. 46.  4. 34. 42.  2. 47. 14.
 18.  9. 50. 34. 12. 24. 42. 24.  3. 39. 17. 15. 37. 18. 46. 25.  9. 41.
 45. 34. 22. 26. 27. 44. 28.  4. 15. 31.  3. 39. 15. 23.  5. 27. 11. 25.
 16. 11.  2. 43. 35. 45. 27. 48. 44. 20.  4. 21.  8. 48. 29. 20. 15. 20.
 37. 17.  6. 13. 39. 25.  5. 11.  4. 20. 47.  9.  2.  8. 44. 40.  8.  1.
 45. 26. 43. 10. 22. 24.  3. 48. 29. 49.]


In [23]:
# What is the shape of this array?
# This also tells us how many dimensions there are --
# one number means one dimension
my_array.___

100

In [None]:
# What is its datatype?
___

In [20]:
# Reshape this result into a 10x10 array
my_array = np.reshape(___, ___)

(100,)

In [None]:
# What is the shape of our array now?
___

In [None]:
# Take the sqrt of this array
my_array = np.___(___)
my_array

In [21]:
# Access the element in the fourth row
# and second column of the array
my_array___

array([6.8556546 , 4.58257569, 4.79583152, 4.89897949, 6.70820393,
       2.44948974, 5.47722558, 6.55743852, 6.70820393, 4.79583152,
       1.41421356, 6.78232998, 2.        , 5.83095189, 6.4807407 ,
       1.41421356, 6.8556546 , 3.74165739, 4.24264069, 3.        ,
       7.07106781, 5.83095189, 3.46410162, 4.89897949, 6.4807407 ,
       4.89897949, 1.73205081, 6.244998  , 4.12310563, 3.87298335,
       6.08276253, 4.24264069, 6.78232998, 5.        , 3.        ,
       6.40312424, 6.70820393, 5.83095189, 4.69041576, 5.09901951,
       5.19615242, 6.63324958, 5.29150262, 2.        , 3.87298335,
       5.56776436, 1.73205081, 6.244998  , 3.87298335, 4.79583152,
       2.23606798, 5.19615242, 3.31662479, 5.        , 4.        ,
       3.31662479, 1.41421356, 6.55743852, 5.91607978, 6.70820393,
       5.19615242, 6.92820323, 6.63324958, 4.47213595, 2.        ,
       4.58257569, 2.82842712, 6.92820323, 5.38516481, 4.47213595,
       3.87298335, 4.47213595, 6.08276253, 4.12310563, 2.44948

# Working with `pandas`

When you think of "tabular data" in Python, think of `pandas`. 

This package is built on top of `numpy`, but brings some extra functionalities for us. 


### `pandas` Series

These are one-dimensional data structures in `pandas`. 

We won't spend too much time analyzing these, but it's important to know that `pandas` will by default convert any one-dimensional data structure into a Series. 


![pandas Series](images/series.png)

### DataFrames

We will focus on the `pandas` DataFrame, which is a two-dimensional, tabular data structure with labeled rows and columns. Below is an example:

![`pandas` DataFrame example](images/dataframe.png)


*Look familiar?* This is very much the way we often store data in a spreadsheet.

One key difference between `numpy` arrays and  `pandas` DataFrames is that the columns of DataFrames can be of different data types:

![DataFrame column data types](images/datatypes.png)

This is a *lot* like a spreadsheet!

## Importing `pandas`

Same as with `numpy`, we will need to call in `pandas` each time we want to use it.

Similarly to `numpy`, it is common to *alias* `pandas` when we import it. This alias usually takes the form:

`import pandas as pd`

Go ahead and try it yourself in the cell below!

In [9]:
# Load pandas into our session
import pandas as pd

### Creating DataFrames

There are several ways to create a DataFrame. We could, for example, convert a `numpy` array into one, using the `DataFrame` function.

In [10]:
# Create our array
numpy_data = np.array([[1,2,3], [4,5,6]])

# Convert into a DataFrame
df = pd.DataFrame(data=numpy_data)

df

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


By default, our DataFrame includes column *names* and row *index labels*... **starting at zero**!

![labelled image of DataFrames](images/dataframe-labelled.png)

It's common to keep the index labels as numeric, but to name the columns. Let's do it:

In [11]:
# Create our array
numpy_data = np.array([[1,2,3], [4,5,6]])

# Convert into a DataFrame,
# name the columns
df = pd.DataFrame(data=numpy_data, columns=['Column A','Column B','Column C'])

df

Unnamed: 0,Column A,Column B,Column C
0,1,2,3
1,4,5,6


## Reading data into `pandas`

While it's possible to create a DataFrame from an existing data structure (like a `numpy` array), you'll more likely do by importing data from an outside source. 

`pandas` can create DataFrames from practically any data format, including SQL databases and HTML. But let's focus on `csv` files and Excel workbooks.

### Reading from `csv` files

We can import a `csv` file as a DataFrame with `read_csv()`.

There are a *lot* of [optional arguments to provide `read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but we *must* provide a file path.


#### Interlude: file paths and directories

It can sometimes be tricky to locate *where* a file is located to read it into Python.

By default, the file path you specify needs to be *relative to* the locations of your Python file. If you aren't sure where that is, you can check your *working directory* with `os.getcwd()`

In [15]:
# What folder am I operating in on my computer?
os.getcwd()

'c:\\Users\\GeorgeM\\Documents\\GitHub\\olt-first-steps-with-python-for-spreadsheet-users\\2-working-with-tabular-data'

In this case, the `state-populations.csv` file is located in the `data` folder of this directory, so we can find it here:

In [12]:
# Read in the state-populations file that exists in the data folder
pd.read_csv('data/state-populations.csv')

Unnamed: 0,name,Year,Population
0,Alabama,2010,4785492
1,Alabama,2011,4799918
2,Alabama,2012,4815960
3,Alabama,2013,4829479
4,Alabama,2014,4843214
...,...,...,...
352,Wyoming,2012,576765
353,Wyoming,2013,582684
354,Wyoming,2014,583642
355,Wyoming,2015,586555


That's nice we were able to read this into Python... but to do much of anything with it, we'll need to assign it to a variable:

In [26]:
# Read in csv as DataFrame, assign to variable
state_pop = pd.read_csv('data/state-populations.csv')

# Now I can refer to the variable and operate on it...
# This one isn't real helpful 😼
state_pop * 2

Unnamed: 0,name,Year,Population
0,AlabamaAlabama,4020,9570984
1,AlabamaAlabama,4022,9599836
2,AlabamaAlabama,4024,9631920
3,AlabamaAlabama,4026,9658958
4,AlabamaAlabama,4028,9686428
...,...,...,...
352,WyomingWyoming,4024,1153530
353,WyomingWyoming,4026,1165368
354,WyomingWyoming,4028,1167284
355,WyomingWyoming,4030,1173110


### Reading from Excel

We'll now read in a file `state-populations.xlsx`, also located in the `data` folder.


Reading Excel workbooks into DataFrames will work similar to `csv` files. 

This time, we'll use `read_excel()`. 

Once again, there are [lots of optional arguments](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html), but we must provide a file path: 

In [3]:
# Read in our data
state_pop = pd.read_excel('data/state-populations.xlsx')
state_pop

Unnamed: 0,name,Year,Population,Density
0,Alabama,2010,4785492,50750.0
1,Alabama,2011,4799918,50750.0
2,Alabama,2012,4815960,50750.0
3,Alabama,2013,4829479,50750.0
4,Alabama,2014,4843214,50750.0
...,...,...,...,...
352,Wyoming,2012,576765,97105.0
353,Wyoming,2013,582684,97105.0
354,Wyoming,2014,583642,97105.0
355,Wyoming,2015,586555,97105.0


If our workbook contains multiple worksheets and we want to read specific one(s), we would specify the `sheet_name` argument:

In [4]:
# Read the `populations` worksheet
state_pop = pd.read_excel('data/state-populations.xlsx', sheet_name='populations')
print(state_pop)

# There is also a `readme` worksheet
readme = pd.read_excel('data/state-populations.xlsx',sheet_name='readme')
print(readme)

name  Year  Population  Density
0    Alabama  2010     4785492  50750.0
1    Alabama  2011     4799918  50750.0
2    Alabama  2012     4815960  50750.0
3    Alabama  2013     4829479  50750.0
4    Alabama  2014     4843214  50750.0
..       ...   ...         ...      ...
352  Wyoming  2012      576765  97105.0
353  Wyoming  2013      582684  97105.0
354  Wyoming  2014      583642  97105.0
355  Wyoming  2015      586555  97105.0
356  Wyoming  2016      585501  97105.0

[357 rows x 4 columns]
  Data sources:                                       Unnamed: 1
0     US Census  https://www.census.gov/prod/cen2010/cph-2-1.pdf


That second worksheet doesn't look much like a table of data... but `pandas` did its best to make it so! 

This is a good reminder that DataFrames will always be two-dimensional structures where all the rows in a given column are of the same data tyupe. 

That's it for reading data from Excel. 

If you are interested in using Python to automate the creation of Excel workbooks, check out my OLT session on "Python-Powered Excel."

## Reading from Google Sheets

For many users, spreadsheets mean Google Sheets. 

It is possible to read DataFrames from Google Sheets but it requires using Google's API. Due to that added setup, we will skip for this workshop.

[For instructions on reading from Google Sheets, check out this blog post.](https://towardsdatascience.com/accessing-google-spreadsheet-data-using-python-90a5bc214fd2)


## Exploring our DataFrame

*Success*! Our data has been read in and assigned to a variable. Now let's get to know our data. 

We can of course get a start with `print()`:


In [5]:
print(state_pop)

name  Year  Population  Density
0    Alabama  2010     4785492  50750.0
1    Alabama  2011     4799918  50750.0
2    Alabama  2012     4815960  50750.0
3    Alabama  2013     4829479  50750.0
4    Alabama  2014     4843214  50750.0
..       ...   ...         ...      ...
352  Wyoming  2012      576765  97105.0
353  Wyoming  2013      582684  97105.0
354  Wyoming  2014      583642  97105.0
355  Wyoming  2015      586555  97105.0
356  Wyoming  2016      585501  97105.0

[357 rows x 4 columns]


However, this is a *lot* of rows to look through. It's so much that Python doesn't even print all of them! You could [change the options](https://dev.to/chanduthedev/how-to-display-all-rows-from-data-frame-using-pandas-dha) to print all rows, but that's really not an effective way to size up our DataFrame. 

Instead, let's use the below methods and attributes to explore more efficiently:

 

| Method/attribute | Returns                                                         |
| ---------------- | --------------------------------------------------------------- |
| `df.info()`      | Column names with their data type and number of complete values |
| `df.columns`     | Column names                                                    |
| `df.dtypes`      | Data types                                                      |
| `df.shape`       | Dimensions (# rows by # columns)                                |
| `df.head()`      | First 5 rows                                                    |
| `df.tail()`      | Last 5 rows                                                     |
| `df.describe()`  | Descriptive statistics                                          |


`df` is a common stand-in for a generic DataFrame which you'll often see in examples. 

In [6]:
# Return column names, data types and number of complete values
state_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 357 entries, 0 to 356
Data columns (total 4 columns):
name          357 non-null object
Year          357 non-null int64
Population    357 non-null int64
Density       350 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 11.3+ KB


In [7]:
# Return column names
state_pop.columns

Index(['name', 'Year', 'Population', 'Density'], dtype='object')

In [8]:
# Return data types
state_pop.dtypes

name           object
Year            int64
Population      int64
Density       float64
dtype: object

In [9]:
# Return shape
state_pop.shape

(357, 4)

In [10]:
# Get first 5 rows
state_pop.head()

Unnamed: 0,name,Year,Population,Density
0,Alabama,2010,4785492,50750.0
1,Alabama,2011,4799918,50750.0
2,Alabama,2012,4815960,50750.0
3,Alabama,2013,4829479,50750.0
4,Alabama,2014,4843214,50750.0


In [11]:
# Get last 5 rows
state_pop.tail()

Unnamed: 0,name,Year,Population,Density
352,Wyoming,2012,576765,97105.0
353,Wyoming,2013,582684,97105.0
354,Wyoming,2014,583642,97105.0
355,Wyoming,2015,586555,97105.0
356,Wyoming,2016,585501,97105.0


In [12]:
# Get descriptive statistics
state_pop.describe()

Unnamed: 0,Year,Population,Density
count,357.0,357.0,350.0
mean,2013.0,6201127.0,70725.4
std,2.002807,6984741.0,85096.140584
min,2010.0,564513.0,1034.0
25%,2011.0,1652828.0,35870.0
50%,2013.0,4400477.0,54155.5
75%,2015.0,6895226.0,81823.0
max,2016.0,39250020.0,570641.0


# DRILL

Practice reading in and exploring the two files in the `practice` folder.

1. `largest-us-cities.csv`: Find the data types and dimensions of this DataFrame. Also print out the first five rows.
2. `chicago-big-ten.xlsx`: The worksheet you're interested in is called `alumni`. Get the column names and run the descriptive statistics.

Congratulations on reading and exploring data in `pandas`! In the following sections we'll look at manipulating DataFrames and then visualizing the results. 