# 4. Python data science modules

The core packages for data science in Python are

* numpy - arrays
* pandas - data tables
* matplotlib and seaborn - plotting




## Numpy

Numpy is a Python package for representing **array data**, and comes with a large library of tools and mathematical functions that operate efficiently on arrays.

If you're familiar with more math-oriented programming languages like R or MATLAB, numpy brings much of the builtin math and data functionality from those languages into Python.

Numpy is by far the most popular Python package for data science, and is one of the [most-downloaded](https://pypistats.org/top) python packages overall. It's so useful and reliable, that most of the mathematical functionality of the other packages covered in this module (pandas, seaborn, matplotlib) is provided by numpy under the hood.

Because numpy is used a lot, it's convention to import it with the `np` abbreviation:

In [None]:
import numpy as np


### Why numpy?

A numpy array is similar to a Python list: they can both serve as containers for numbers.

In [None]:
python_list = [0, 2, 4, 6]
print(python_list)

In [None]:
numpy_array = np.array([0, 2, 4, 6])
print(numpy_array)

So why use numpy instead of lists?

* Speed
    * Although numpy is a Python package, most of the functionality is written in fast C or Fortran code.
* Memory efficient
    * Numpy uses less memory to store numbers than Python, so you can work on larger datasets.
* Functionality
    * Numpy comes with a huge range of modules with fast and thoroughly-validated algorithms from interpolation to fourier transforms.
* Manipulation syntax
    * Numpy's syntax makes it clear and easy to perform common array operations, like slicing, filtering, and summarization.


But there are some usecases where lists make more sense

* Storing different kinds of data together
    * Numpy arrays are homogeneous, all the elements must be the same type
* Working with non-numerical data
    * Some numpy functionality works with strings and other types, but performance can suffer




### Array slicing and indexing


One way to create an array is from a Python sequence like a list


In [None]:
days_per_month = np.array([31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31])
days_per_month

Like the original Python list, arrays can be sliced

In [None]:
days_per_month[0:3]

and individual elements can be index out

In [None]:
print(days_per_month[1])

Two-dimensional (and higher dimensional, there's no limit in numpy!) arrays can be created from nested Python sequences:



In [None]:
array_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
array_2d

You index a 2D array using the same notation: first your row slicing, then a comma `,`, then your column slicing.

For example the first element of the second row:

In [None]:
print(array_2d[1, 0])

or the top three values of the last column

In [None]:
print(array_2d[0:3, -1])

Note how our column has lost its "verticalness": once we've sliced it out, it's just a regular 1D array.

### Creating arrays

As well as converting Python lists to arrays, numpy can create its own arrays!

You can create an array that's filled with zeros

In [None]:
np.zeros(5)

or ones (remember numpy uses the row, col ordering)

In [None]:
np.ones((2, 5))

Numpy has it's own version of the Python `range` function:

In [None]:
np.arange(2, 9, 2)

and a related `linspace` function to create an array with elements evenly spaced

In [None]:
np.linspace(0, 10, num=5)

### Array attributes

The shape attribute gives the rows and cols (in that order!) of an array

In [None]:
array_2d.shape



```python




1. Standard library  
2. Numpy  
   1. Why numpy?   
   2. Creating Arrays  
   3. Array Dimensions  
   4. Array Operations   
   5. Slicing, Indexing, and Broadcasting  
   6. Dot product, cross product, matrix multiplication  
   7. Exporting and loading arrays   
3. Pandas  
   1. Dataframes  
   2. DataFrame structure  
      1. Columns  
      2. Index, datetime index, datetime module   
   3. Loading dataframes from .csv and .xls files   
      1. Dealing with messy data  
      2. Example dataset  
      3. Cleaning real messy data  
   4. Selecting columns  
   5. Filtering by conditionals  
   6. Helpful dataframe functions   
      1. Convert a dict to a dataframe   
   7. Advanced dataframe topics  
      1. Multiindex   
      2. .apply   
      3. .groupby   
4. Matplotlib  
   1. Line plot  
   2. Scatterplot  
   3. Plotting 2d arrays with imshow  
   4. Formatting plots   
      1. Title  
      2. Axis labels  
      3. Legend  
5. Seaborn  
   1. Relplot  
   2. Distplot   
   3. Catplot   
6. Practical Example \- scatterplot and linear regression   
   1. Load two datasets  
   2. Create pandas dataframe with each as a column  
   3. Do a linear regression between two columns  
   4. Plot scatterplot and linear regression using matplotlib  
   5. Add axis labels, legend, title, regression equation   
   6. Save plot to .png 
