# importing, manipulating, and representing data 

The basis of any statistical analysis is the underlying data.

A data-set is typically presented as a file containing information formatted as a table:
 * each line correspond to an observation ( individual, sample, ... )
 * each column correspond to a measured variable ( height, sex, gene expression, ... )


To read data file and manipulate the data, we will rely on [pandas](https://pandas.pydata.org/)
Pandas is a "high-level" module, designed for statistics/exploratory analysis.
A great strength of pandas is its **DataFrame** which emulates many of the convenient behavior and syntax of their eponym counterpart in the **R** language.


To graphically represent the data, we will rely on [seaborn](https://seaborn.pydata.org/index.html).
Seaborn is designed to work hand-in-hand with pandas DataDrame to produce **efficient data representation** from fairly simple commands. They propose very good tutorials as well as a gallery (with associated code) that can get you started quickly.


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
print('OK')

OK


# ToC <a id='toc'></a>


1. [Reading the data](#reading)

    1.1. [the basics](#reading.1)

    1.2. [header or not header, that is the question](#reading.2)

    1.3. [setting up the row index](#reading.3)

    1.4. [other options](#reading.4)

    1.5. [more formats](#reading.5)

2. [data manipulation](#manip)

    2.1. [first contact with the data](#manip.1)

    2.2. [accessing specific parts of the data - rows and columns](#manip.2)

    2.3. [accessing specific parts of the data - selection](#manip.3)

    2.4. [Operations on columns](#manip.4)
    
    2.4. [adding/removing and combining columns](#manip.5)


[back to toc](#toc)

## 1. Reading the data <a id='reading'></a> 
[back to toc](#toc)
### 1.1. the basics <a id='reading.1'></a> 

what is the file name? location?
what is the saprator between fields??


`pd.read_table` is a generalistic function to read table. Aside from the name of the file to read, here are some useful parameters:
* `sep` : separator between columns (by default '\t')
* `header` : Row number(s) to use as the column names. By default it will use the first line as a header. use `header=None` if the file does not contain column name.
* `skiprows` : Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
* (true_values/false_values ??)

Of course you can learn (much) more using `help(pd.read_table)`.



Let's try to load the `data/titanic.csv` file. As its name suggest, this table contains data about the ill-fated [Titanic](https://en.wikipedia.org/wiki/Titanic) passengers, travelling from England to New York in April 1912.

The data-file is named `"titanic.csv"` and like its extension suggests, it contains **C**omma-**S**eparated **V**alues.


In [15]:
import pandas as pd

df = pd.read_table( "data/titanic.csv" ) 
#try to see what happens when sep has a different value

df.head() # this returns the 6 first lines of the table

Unnamed: 0,"Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked"
0,"Bjornstrom-Steffansson Mr. Mauritz Hakan,male,..."
1,"Coleff Mr. Peju,male,36,3,0,0,7.5,S"
2,"Laroche Miss. Simonne Marie Anne Andree,female..."
3,"Smith Miss. Marion Elsie,female,40,2,1,0,13,S"
4,"Dooley Mr. Patrick,male,32,3,0,0,7.75,Q"


It does not look so great...

**micro-exercise** : try to fix the cell just above by playing with the option(s) of `pd.read_table`


In [3]:
help(pd.read_table)

Help on function read_table in module pandas.io.parsers:

read_table(filepath_or_buffer: Union[ForwardRef('PathLike[str]'), str, IO[~T], io.RawIOBase, io.BufferedIOBase, io.TextIOBase, _io.TextIOWrapper, mmap.mmap], sep=<object object at 0x7feb206fe210>, delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_


<br>

<br>

So you have seen that by default, `read_table()` expects the input data to be **tab-delimited**, but since this is not the case of the `titanic.csv` file, each line was treated as a single field (column), thus creating a DataFrame with a single column.

As implied by its `.csv` extension (for "comma-separeted values"), the `titanic.csv` file contains **comma-delimited** values. To load a CSV file, we can either:
* Specify the separator value in `read_table(sep=",")`.
* Use `read_csv()`, a function that will use comma as separator by default.

In [16]:
import pandas as pd

df = pd.read_table( "data/titanic.csv"  ,sep = ',' ) 
# alternatively : df = pd.read_csv( "data/titanic.csv"  ,sep = ',' ) 

df.head() # this returns the 6 first lines of the table

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C
3,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.0,S
4,Dooley Mr. Patrick,male,32.0,3,0,0,7.75,Q


[back to toc](#toc)
### 1.2. header or not header, that is the question <a id='reading.2'></a>


Another important aspect of reading data is whether your dataset has a header or not. 
By default, `pd.read_table` will expect the first line to  be a header, unless you either :
 * use the argument `header=None` 
 * specify column names using the `names` argument


In [17]:
df = pd.read_table( "data/titanic_no_header.csv"  ,sep = ',' ) 
df.head(n=3) 

Unnamed: 0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28,1,1.1,0,26.55,S
0,Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
1,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C
2,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.0,S


Notice how the **first entry was set as column names**... that is not ideal.

Let's correct this:

In [18]:
df = pd.read_table( "data/titanic_no_header.csv"  ,sep = ',' , header = None) 
df.head(n=3) 

Unnamed: 0,0,1,2,3,4,5,6,7
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C


Much better! 

Let's go one step further and assign our own column names :

In [19]:
df = pd.read_table( "data/titanic_no_header.csv"  ,sep = ',' , 
                   names = ['name','column2','age','column4','blip','bloop','spam','eggs']) 
# as you can see, we can choose our own name, whether they make sense or not
df.head(n=3) 

Unnamed: 0,name,column2,age,column4,blip,bloop,spam,eggs
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C


[back to toc](#toc)
### 1.3. setting up the row index <a id='reading.3'></a>

Now that we have set up column names, let's see how to setup row names, called the **index**.

> not all dataset need an index. oftentimes the default numbered lines is enough.

Again, we have several options at our disposal.

**1. the data file has one less column names that column data**


In [28]:
!head -n 4 data/titanic_implicit_index.csv

Sex,Age,Pclass,Survived,Family,Fare,Embarked
Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28,1,1,0,26.55,S
Coleff Mr. Peju,male,36,3,0,0,7.5,S
Laroche Miss. Simonne Marie Anne Andree,female,3,2,1,1,41.58,C


In [26]:
df = pd.read_table( "data/titanic_implicit_index.csv"  ,sep = ',' ) 
df.head(n=3) 

Unnamed: 0,Sex,Age,Pclass,Survived,Family,Fare,Embarked
Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C


In [22]:
df.index[:5] ## access the index directly

Index(['Bjornstrom-Steffansson Mr. Mauritz Hakan', 'Coleff Mr. Peju',
       'Laroche Miss. Simonne Marie Anne Andree', 'Smith Miss. Marion Elsie',
       'Dooley Mr. Patrick'],
      dtype='object')

**In that case the first, nameless, column is used as index**

**2. we specify the index using `index_col`**

In [31]:
df = pd.read_table( "data/titanic.csv"  ,sep = ',' , index_col = 0 )
# alternatively , specify the index column by name : index_col = 'Name'
df.head(n=3) 

Unnamed: 0_level_0,Sex,Age,Pclass,Survived,Family,Fare,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C


> Note that pandas also has a system of multiple, hierarchised indexing. This is, however, a much more specialized and advanced feature.

[back to toc](#toc)

### 1.4. other options <a id='reading.4'></a>

`pd.read_table` has a vast arrays of option.
We cannot go though all of the them, but here are a few which may be of interest to you:

* `true_values`/`false_values`, each a list. a must if you have columns encoded with "yes"/"no" labels.
* `na_values` : takes a list. Ideal when your NAs are encoded as something unusual (eg, `.`,` `,`-9999`,...)
* `parse_dates`/`infer_datetime_format`/`date_parser` : options to help you handle date parsing, which can oitherwise be a nightmare. [more on this](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)
* `compression` : your data is in a compressed (zip, gzip, ...), not a problem!



[back to toc](#toc)

### 1.5. more formats <a id='reading.5'></a>

As you might expect, pandas is not limited to text, csv/tsv-like files.

* `pd.read_excel()`
* `pd.read_json()`
* `pd.read_sql()` 
* ... see [here for an exhaustive list of pandas reader and writer functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).



In [39]:
genbank_df = pd.read_json( 'data/genbank.sub.ndjson' , lines = True )
genbank_df.head()

Unnamed: 0,genbank_accession,genbank_accession_rev,database,strain,region,location,collected,submitted,length,host,isolation_source,biosample_accession,title,authors,publications,sequence
0,MW553299,MW553299.1,GenBank,SARS-CoV-2/human/ARG/Cordoba-189-251/2020,South America,Argentina,2020-05-16,2021-02-01T00:00:00Z,29719,Homo sapiens,oronasopharynx,,Severe acute respiratory syndrome coronavirus ...,Direct Submission,,GATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCG...
1,MW553294,MW553294.1,GenBank,SARS-CoV-2/human/ARG/Cordoba-2635-202/2020,South America,Argentina,2020-06-04,2021-02-01T00:00:00Z,29723,Homo sapiens,oronasopharynx,,Severe acute respiratory syndrome coronavirus ...,Direct Submission,,GATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCG...
2,MW553295,MW553295.1,GenBank,SARS-CoV-2/human/ARG/Cordoba-2842-202/2020,South America,Argentina,2020-06-04,2021-02-01T00:00:00Z,29723,Homo sapiens,oronasopharynx,,Severe acute respiratory syndrome coronavirus ...,Direct Submission,,GATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCG...
3,MW553296,MW553296.1,GenBank,SARS-CoV-2/human/ARG/Cordoba-1083-6/2020,South America,Argentina,2020-06-04,2021-02-01T00:00:00Z,29717,Homo sapiens,oronasopharynx,,Severe acute respiratory syndrome coronavirus ...,Direct Submission,,TCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCAT...
4,MW553297,MW553297.1,GenBank,SARS-CoV-2/human/ARG/Cordoba-11419-61/2020,South America,Argentina,2020-06-04,2021-02-01T00:00:00Z,29724,Homo sapiens,oronasopharynx,,Severe acute respiratory syndrome coronavirus ...,Direct Submission,,GATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCG...



<br>

**micro-exercise :** read the file `data/pbmc_data.countMatrix.50.txt.zip` as a DataFrame. Determine which is the separator, and decide whether there is a header and/or an index column.

In [55]:
df_sc = pd.read_table('data/pbmc_data.countMatrix.50.txt.zip',sep=' ', index_col = 0)
df_sc.head()

Unnamed: 0_level_0,AAACATACAACCAC,AAACATTGAGCTAC,AAACATTGATCAGC,AAACCGTGCTTCCG,AAACCGTGTATGCG,AAACGCACTGGTAC,AAACGCTGACCAGT,AAACGCTGGTTCTT,AAACGCTGTAGCCA,AAACGCTGTTTCTG,...,AAATTCGAGGAGTG,AAATTCGATTCTCA,AAATTGACACGACT,AAATTGACTCGCTC,AACAAACTCATTTC,AACAAACTTTCGTT,AACAATACGACGAG,AACACGTGCAGAGG,AACACGTGGAAAGT,AACACGTGGAACCT
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
MIR1302-10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
FAM138A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
OR4F5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
RP11-34P13.7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
RP11-34P13.8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0



<br>

[back to toc](#toc)

<br>
<br>


Ok , so now you know (almost) everything ther is to know  about getting your data from a file to a `DataFrame`. 

Now let's see what we can actually do with these!



## 2. data manipulation <a id='manip'></a>


### 2.1 first contact with the data <a id='manip.1'></a>

Gathering basic information about the data-set is fairly easy, but first, let re-load the titanic data :

In [140]:
df = pd.read_table('data/titanic.csv',sep=',')

`df.shape`: returns a tuple with the numbers of rows and columns: `(row_count, col_count)`.

In [58]:
numberRows , numberCols = df.shape
print('rows:',numberRows , 'columns:',numberCols)

rows: 891 columns: 8
column names


Index(['Name', 'Sex', 'Age', 'Pclass', 'Survived', 'Family', 'Fare',
       'Embarked'],
      dtype='object')

In [59]:
print('column names')
df.columns

column names


Index(['Name', 'Sex', 'Age', 'Pclass', 'Survived', 'Family', 'Fare',
       'Embarked'],
      dtype='object')

In [60]:
print('index')
df.index

index


RangeIndex(start=0, stop=891, step=1)

> Note how here the index is number-based

<br>

The `df.columns` and `df.index` attributes can also be used to set new values for column names and index labels.

In [62]:
df.columns = [x.upper() for x in df.columns]
df.index = ['passenger'+str(i) for i in df.index]
df.head()

Unnamed: 0,NAME,SEX,AGE,PCLASS,SURVIVED,FAMILY,FARE,EMBARKED
passenger0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
passenger1,Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
passenger2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C
passenger3,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.0,S
passenger4,Dooley Mr. Patrick,male,32.0,3,0,0,7.75,Q


In [65]:
# Let's reset out changes in index and column names:
df.columns = df.columns.str.capitalize()
df.index = range(0, df.shape[0])
df

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,36.0,3,0,0,7.50,S
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C
3,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.00,S
4,Dooley Mr. Patrick,male,32.0,3,0,0,7.75,Q
...,...,...,...,...,...,...,...,...
886,Ryerson Miss. Susan Parker,female,21.0,1,1,2,262.38,C
887,Hogeboom Mrs. John C (Anna Andrews),female,51.0,1,1,1,77.96,S
888,Vanden Steen Mr. Leo Peter,male,28.0,3,0,0,9.50,S
889,Baclini Miss. Marie Catherine,female,5.0,3,1,2,19.26,C


> Note how we are able to apply a `str` function to all the column names at once ? That is a very powerful feature, which we'll discuss later.

Each column is associated to a type, which controls the operations you may perform there :

In [66]:
print( "columns types:\n" ,df.dtypes )# lists the types

columns types:
 Name         object
Sex          object
Age         float64
Pclass        int64
Survived      int64
Family        int64
Fare        float64
Embarked     object
dtype: object


Here we can see that values were interpreted as either 
 * `object` : catch all for text, intermixed or not with numbers
 * `float64` : float
 * `int64` : integer
 

**Do you see something that should be changed here?**



[back to toc](#toc)

### 2.2 accessing specific parts of the data - rows and columns <a id='manip.2'></a>

One can access a column just by using `df[<column name>]` :

In [70]:
df['Sex']

0        male
1        male
2      female
3      female
4        male
        ...  
886    female
887    female
888      male
889    female
890      male
Name: Sex, Length: 891, dtype: object

> See how `pandas` only print the 5 first and 5 last lines of the column to avoid clogging your screen, as well as some useful info.

Alternatively, you can just use `df.<column name>`

In [71]:
df.Sex

0        male
1        male
2      female
3      female
4        male
        ...  
886    female
887    female
888      male
889    female
890      male
Name: Sex, Length: 891, dtype: object

<br>

**Subsetting a DataFrame with the `loc[]` and `iloc[]` indexers**

A very common operation to perform on DataFrames is to create a subset by selecting certain rows and/or columns.  
There are 2 methods in pandas to perform a selection on a DataFrame (here `df`):
* **position based:** using `df.iloc[<row selection>, <column selection>]`
* **index/label based:**  using `df.loc[<row selection>, <column selection>]`

![image.png](img/pandas_position_vs_index_selection.png)

To select all rows/or columns, the symbol `:` can be used as row or column selection. It works with both `.loc[]` and `.iloc[]`:
* `df.loc[<row selection>,:]`: select all columns.
* `df.loc[:, <column selection>]`: select all rows.

When selecting on rows only (i.e. select all columns), the `df.loc[<row selection>, ]` and `df.loc[<row selection>]` syntaxes are also possible (i.e. the `:` is not compulsory in that case).

<br>

**Common pitfall with slicing :** `loc[]` includes the endpoint, `iloc[]` does not:

In [74]:
df.loc[ 0:3 , : ]  # this selects the first 4 rows.

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C
3,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.0,S


In [73]:
df.iloc[ 0:3 , : ]   # this selects the first 3 rows.

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C


Let's apply all this : 

In [75]:
df.loc[ 0:3 , : ] ### first 3 rows, all columns

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C
3,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.0,S


In [77]:
df.loc[ : , ['Sex' , 'Name' , 'Name' , 'Age'] ] ### first 3 rows, columns 'town name' and 'Total'

Unnamed: 0,Sex,Name,Name.1,Age
0,male,Bjornstrom-Steffansson Mr. Mauritz Hakan,Bjornstrom-Steffansson Mr. Mauritz Hakan,28.0
1,male,Coleff Mr. Peju,Coleff Mr. Peju,36.0
2,female,Laroche Miss. Simonne Marie Anne Andree,Laroche Miss. Simonne Marie Anne Andree,3.0
3,female,Smith Miss. Marion Elsie,Smith Miss. Marion Elsie,40.0
4,male,Dooley Mr. Patrick,Dooley Mr. Patrick,32.0
...,...,...,...,...
886,female,Ryerson Miss. Susan Parker,Ryerson Miss. Susan Parker,21.0
887,female,Hogeboom Mrs. John C (Anna Andrews),Hogeboom Mrs. John C (Anna Andrews),51.0
888,male,Vanden Steen Mr. Leo Peter,Vanden Steen Mr. Leo Peter,28.0
889,female,Baclini Miss. Marie Catherine,Baclini Miss. Marie Catherine,5.0


> Note : I am free to select a column several time, in whichever order I wish

In [81]:
df.loc[ : , 'Name':'Fare' ] ### all rows, column 'Name' to Column "Fare"

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55
1,Coleff Mr. Peju,male,36.0,3,0,0,7.50
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58
3,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.00
4,Dooley Mr. Patrick,male,32.0,3,0,0,7.75
...,...,...,...,...,...,...,...
886,Ryerson Miss. Susan Parker,female,21.0,1,1,2,262.38
887,Hogeboom Mrs. John C (Anna Andrews),female,51.0,1,1,1,77.96
888,Vanden Steen Mr. Leo Peter,male,28.0,3,0,0,9.50
889,Baclini Miss. Marie Catherine,female,5.0,3,1,2,19.26


In [83]:
df.iloc[0:2, [0,2,3]]                   # Select the first 2 rows, and columns 0,2, and 3

Unnamed: 0,Name,Age,Pclass
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,28.0,1
1,Coleff Mr. Peju,36.0,3


In [84]:
df.iloc[0, [0,2,3]]                   # Select the first row, and columns 0,2, and 3

Name      Bjornstrom-Steffansson Mr. Mauritz Hakan
Age                                           28.0
Pclass                                           1
Name: 0, dtype: object

**BTW, what do I get when I select a single row/column ?**

You may have noticed that it does not get represented as a list, so what is it?

In [87]:
# Select a single row:
row_4 = df.iloc[3,]
row_4b = df.loc[3,]
print(type(row_4))
print(type(row_4b))

# Select a single column. Note that when selecting by columns only, ":" must be used to indicate
# that all rows should be selected.
col_age_a = df.loc[:,"Age"]
col_age_b = df.iloc[:,3]
col_age_c = df["Age"]          # When selecting based on columns only, using this syntax is simpler.
print(type(col_age_a))
print(type(col_age_b))


<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


These are the equivalent of the `DataFrame`, but 1-dimensional.

Their elements can be accessed in quite a similar way:

In [91]:
print( row_4 )
print( '---')
print( row_4[0] ) # by position
print( '---')
print( row_4.Age ) # by name 

Name        Smith Miss. Marion Elsie
Sex                           female
Age                             40.0
Pclass                             2
Survived                           1
Family                             0
Fare                            13.0
Embarked                           S
Name: 3, dtype: object
---
Smith Miss. Marion Elsie
---
40.0


[back to toc](#toc)

### 2.3 accessing specific parts of the data - selection <a id='manip.3'></a>

Another, powerful, way of accessing specific part of the data is by defining a mask, which will **filter the data through a particular condition**.


In [102]:
maskMale = df[ 'Sex' ] == 'male'  
maskMale

0       True
1       True
2      False
3      False
4       True
       ...  
886    False
887    False
888     True
889    False
890     True
Name: Sex, Length: 891, dtype: bool

The mask (in fact a `pandas.core.series.Series`), is in effect a list of values that are ```True``` or `False` depending on whether or not they satisfy the defined condition (`Sex` is equal to 'male', here).

A great method of `Series` containing categorical kind of data (such as `True`/`False` values only) is `values_counts()`


In [96]:
maskMale.value_counts()

True     577
False    314
Name: Sex, dtype: int64

The mask can also be applied to the `DataFrame`.

In [98]:
df.loc[ maskMale , ['Sex','Fare','Survived'] ]

Unnamed: 0,Sex,Fare,Survived
0,male,26.55,1
1,male,7.50,0
4,male,7.75,0
5,male,26.00,0
7,male,8.40,0
...,...,...,...
880,male,27.72,0
882,male,6.50,0
884,male,63.36,1
888,male,9.50,0


> Note that only the 577 rows of the male have been selected.


Masks may be combined to produce more complex selection criteria, using **`&`** (logical and) and **`|`** (logical or).

In [108]:
# male passenger with a fare > 200
mask = (df[ 'Sex' ] == 'male') & ( df.Fare > 200  ) 
df.loc[ mask , ]

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
195,Robbins Mr. Victor,male,,1,0,0,227.53,C
292,Baxter Mr. Quigg Edmond,male,24.0,1,0,0,247.52,C
359,Widener Mr. Harry Elkins,male,27.0,1,0,0,211.5,C
375,Farthing Mr. John,male,,1,0,0,221.78,S
629,Lesurer Mr. Gustave J,male,35.0,1,1,0,512.33,C
720,Cardeza Mr. Thomas Drake Martinez,male,36.0,1,1,0,512.33,C
723,Fortune Mr. Charles Alexander,male,19.0,1,0,3,263.0,S
820,Fortune Mr. Mark,male,64.0,1,0,1,263.0,S


In [107]:
# Select all people that are either < 25 or a women.
df.loc[(df.Age < 25) | (df.Sex == "female"), :]

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C
3,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.00,S
6,Goodwin Miss. Lillian Amy,female,16.0,3,0,5,46.90,S
8,Fleming Miss. Margaret,female,,1,1,0,110.88,C
10,Panula Master. Juha Niilo,male,7.0,3,0,4,39.69,S
...,...,...,...,...,...,...,...,...
884,Greenfield Mr. William Bertram,male,23.0,1,1,0,63.36,C
885,Baclini Miss. Eugenie,female,1.0,3,1,2,19.26,C
886,Ryerson Miss. Susan Parker,female,21.0,1,1,2,262.38,C
887,Hogeboom Mrs. John C (Anna Andrews),female,51.0,1,1,1,77.96,S


> **`.iloc[]` does not support boolean results (True/False) for row selection**: it requires to get
  a position. One can hack its way through that by calling the index of a mask, or using the [query method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html)


**micro-exercise:** Select the Fare and name of  passengers in first class (`Pclass` 1 ) which are less than 18 years old.

<br>

[back to toc](#toc)

### 2.4 Operations on columns <a id='manip.4'></a>


pandas DataFrame allow to use arithmetic operators on columns:

In [112]:
df.Age

0      28.0
1      36.0
2       3.0
3      40.0
4      32.0
       ... 
886    21.0
887    51.0
888    28.0
889     5.0
890    64.0
Name: Age, Length: 891, dtype: float64

In [113]:
df["Age"] = df["Age"] + 1
df.Age

0      29.0
1      37.0
2       4.0
3      41.0
4      33.0
       ... 
886    22.0
887    52.0
888    29.0
889     6.0
890    65.0
Name: Age, Length: 891, dtype: float64

In [114]:
df["Age"] -= 1 # same as df["Age"] = df["Age"] - 1
df.Age

0      28.0
1      36.0
2       3.0
3      40.0
4      32.0
       ... 
886    21.0
887    51.0
888    28.0
889     5.0
890    64.0
Name: Age, Length: 891, dtype: float64

This mechanics becomes quite powerful as we can apply it between or accross whole columns.

For instance, consider this data from the 1880 swiss census :


In [121]:
df_census = pd.read_table('data/swiss_census_1880.csv',sep=',')
df_census.loc[:5,['town name',"Total","Male"]]

Unnamed: 0,town name,Total,Male
0,Aeugst,646,319
1,Affoltern am Albis,2201,1055
2,Bonstetten,771,361
3,Hausen,1363,640
4,Hedingen,907,448
5,Kappel,819,432


The `Total` column gives the total number of registered inhabitants, while the `"Male"` columns gives the number of males.

To get the fraction of the population which is for each records, we simply write :

In [123]:
df_census.Male / df_census.Total

0       0.493808
1       0.479328
2       0.468223
3       0.469552
4       0.493936
          ...   
3185    0.538012
3186    0.501712
3187    0.541963
3188    0.459606
3189    0.458578
Length: 3190, dtype: float64

That can just be assigned to a **new column as if you were adding a ker to a dictionnary** :

In [125]:
df_census['Male Fraction'] = df_census.Male / df_census.Total

df_census.loc[:5,['town name',"Total","Male" , 'Male Fraction']]

Unnamed: 0,town name,Total,Male,Male Fraction
0,Aeugst,646,319,0.493808
1,Affoltern am Albis,2201,1055,0.479328
2,Bonstetten,771,361,0.468223
3,Hausen,1363,640,0.469552
4,Hedingen,907,448,0.493936
5,Kappel,819,432,0.527473




Also, these operations may be combined with a selection operation.

This is particularly useful when you want to mark some data as NAs for instance.


In [129]:
## imagine that for some reason the fares of class 3 are not valid. We want to set them to NA
df.head()

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C
3,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.0,S
4,Dooley Mr. Patrick,male,32.0,3,0,0,7.75,Q


In [132]:
## NA is presented using pd.NA
df.loc[ df.Pclass==3 , 'Fare'] = pd.NA
df.head()

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,36.0,3,0,0,,S
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C
3,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.0,S
4,Dooley Mr. Patrick,male,32.0,3,0,0,,Q


**micro-exercise:** children under the age of 10 get a special discount of 50% on their fare. Apply this by dividing by 2 the `Fare` of eligible passenger in the `df` DataFrame

#### interlude : copy or not copy?

what happens if I select part of a dataFrame and modify it? does the original data stays the same ?

In [175]:
df = pd.read_table( "data/titanic.csv"  ,sep = ',' )

In [161]:
df.loc[ df.Sex == 'male' , : ].head()

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
4,Dooley Mr. Patrick,male,32.0,3,0,0,7.75,Q
5,Kantor Mr. Sinai,male,34.0,2,0,1,26.0,S
7,Olsen Mr. Karl Siegwart Andreas,male,42.0,3,0,0,8.4,S


In [162]:
df.loc[ df.Sex == 'male' , 'Age' ] = 999 

In [163]:
df.head()

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,999.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,999.0,3,0,0,7.5,S
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C
3,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.0,S
4,Dooley Mr. Patrick,male,999.0,3,0,0,7.75,Q


Well OK, the age was changed from the slice to the main `DataFrame` : `pandas` avoid doing copies when it can... BUT :

In [164]:
df_maleOnly = df.loc[ df.Sex == 'male' , : ]

In [165]:
# setting the age to 888 in the dataframe of male only
df_maleOnly.Age = 888

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


What's this? We get a warning!

In [166]:
df_maleOnly.head()

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,888,1,1,0,26.55,S
1,Coleff Mr. Peju,male,888,3,0,0,7.5,S
4,Dooley Mr. Patrick,male,888,3,0,0,7.75,Q
5,Kantor Mr. Sinai,male,888,2,0,1,26.0,S
7,Olsen Mr. Karl Siegwart Andreas,male,888,3,0,0,8.4,S


In [167]:
df.head()

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,999.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,999.0,3,0,0,7.5,S
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C
3,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.0,S
4,Dooley Mr. Patrick,male,999.0,3,0,0,7.75,Q


Indeed the change made to `df_maleOlny` has not been reflected to `df`.


Sadly, it is not always that easy to get when you get a **view** or a **copy**.

![image.png](img/view_copy.png)


 * view : this still point to the original data 
 * copy : new data. modifying this leaves the original data untouched

In general, when using `.loc[]` you should get a view, however that also depends on the evaluation order of some of the performed operations:

In [191]:
df_maleOnly = df.loc[ df.Sex == 'male' , : ] 
## this previous line creates a copy or a view, nobody knows.


df_maleOnly.Age=888
## only when we try to set some data does pandas detect something potentially fishy and warns us.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


If we actually wanted a copy, we should make that explicit to pandas :

In [194]:
df_maleOnly = df.loc[ df.Sex == 'male' , : ].copy()
## this previous line creates a copy


df_maleOnly.Age=888
## nowarning anymore

Anyway, this issue is quite complex, but as you are likley to encounter this warning at some point, it is better to have the cat out of the bag now.

I recommend [this](https://www.dataquest.io/blog/settingwithcopywarning/) to get w better, more in-depth explaination of this.


[back to toc](#toc)

### 2.5 adding/removing and combining columns <a id='manip.5'></a>

We have already seen how to  :

In [148]:
df = pd.read_table( "data/titanic.csv"  ,sep = ',' )

In [29]:
df['14+ y.o.'] = df['15-59 y.o.'] + df['60+ y.o.']
df['14+ y.o.'].head()

0     429
1    1488
2     505
3    1020
4     652
Name: 14+ y.o., dtype: int64

In [30]:
df.columns

Index(['Year', 'town number', 'town name', 'Total', 'Swiss', 'Foreigner',
       'Male', 'Female', '0-14 y.o.', '15-59 y.o.', '60+ y.o.', 'Reformed',
       'Catholic', 'Other', 'German speakers', 'Franch speakers',
       'Italian speakers', 'Romansche speakers',
       'Non-national tongue speakers', 'district number', 'district name',
       'canton number', 'canton', 'canton name', '14+ y.o.'],
      dtype='object')

Removing columns is about as easy:

In [31]:
df = df.drop(columns='14+ y.o.') # use the 'rows' argument to remove rows instead
print("is '14+ y.o.' part of the columns : " , '14+ y.o.' in df.columns)

is '14+ y.o.' part of the columns :  False


In [32]:
df.columns

Index(['Year', 'town number', 'town name', 'Total', 'Swiss', 'Foreigner',
       'Male', 'Female', '0-14 y.o.', '15-59 y.o.', '60+ y.o.', 'Reformed',
       'Catholic', 'Other', 'German speakers', 'Franch speakers',
       'Italian speakers', 'Romansche speakers',
       'Non-national tongue speakers', 'district number', 'district name',
       'canton number', 'canton', 'canton name'],
      dtype='object')