# Week 3 - Data handling: pandas and numpy

The Python modules `pandas` and `numpy` are very useful tools to handle datasets and apply basic operations on them. 

Some of the things we have learnt in weeks 1 and 2 using native Python (e.g. accessing, working with and writing data files) can often be easily achieved using `pandas` instead. This module offers data structures and operations for manipulating different types of datasets - see [documentation](https://pandas.pydata.org/)


### Installing the modules

The modules `pandas` and `numpy` do not come as part of the default Anaconda installation. In order to install them in your system, open your Windows command line and run the following command: `conda install pandas`. The module `numpy` should install automatically with the `pandas` installation.


### Loading modules

Once they are installed, we can import them using the aliases `pd` and `np` as follows:

In [None]:
import pandas as pd
import numpy as np

### Reading datasets with `pandas`

We are going to use the METABRIC dataset `metabric_clinical_and_expression_data.csv` containing information about breast cancer patients as we did in weeks 1 and 2.

Pandas allows importing data from various file formats such as csv, xls, json, sql ... To read a csv file, use the function `read_csv()`:

In [None]:
metabric = pd.read_csv("../data/metabric_clinical_and_expression_data.csv")
metabric

If you forget to include `../data/` above, or if you include it but your copy of the file is saved somewhere else, you will get an error that ends with a line like this: `FileNotFoundError: File b'metabric_clinical_and_expression_data.csv' does not exist`

Generally, columns in a `DataFrame` are known as the observed variables, and the rows are the observations.

The pandas `DataFrame` object borrows many features from R's `data.frame`. Both are 2-dimensional tables whose columns can contain different data types (e.g. boolean, integer, float, categorical/factor). Both the rows and columns are indexed, and can be referred to by number or name.

Looking at the column on the far left, you can see that the DataFrame `metabric` is indexed using the 0-based notation common in Python.

Note that the `.read_csv()` function is not limited to reading csv files. For example, you can also read Tab Separated Value (TSV) files by adding the argument `sep='\t'`.