# Python Lecture 8: basic analysis with pandas dataframes 

Today we start doing basic data analysis using Python and the **pandas** (panel data) package.

We will use two classes of the `pandas` package: `Series` and `DataFrame`.

#### Series

* Pandas Series contain a column of values together with an index. 
* Instead of the build-in Python data types `Series` objects use datatypes which closer resemble those in C/C++:

In [3]:
import pandas as pd

a = pd.Series([1.0, 2.1, 3.2, 5.5])
a

0    1.0
1    2.1
2    3.2
3    5.5
dtype: float64

In [4]:
b = pd.Series([1.9, 2.0, 2.1, 2.3], index=[1, 2, 4, 5])
b

1    1.9
2    2.0
4    2.1
5    2.3
dtype: float64

We can calculate with pandas series as follows:

In [6]:
c = pd.Series([4.0, 3.0, 2.0, 1.0])
print(a + c)
print(c + 2)
print(c ** 2)

0    5.0
1    5.1
2    5.2
3    6.5
dtype: float64
0    6.0
1    5.0
2    4.0
3    3.0
dtype: float64
0    16.0
1     9.0
2     4.0
3     1.0
dtype: float64


Moreover, `Series` support slicing:

In [7]:
c[1::2]

1    3.0
3    1.0
dtype: float64

When adding `Series` objects, the elements are matched according to their index:

In [8]:
print(b)
print(a + b)

1    1.9
2    2.0
4    2.1
5    2.3
dtype: float64
0    NaN
1    4.0
2    5.2
3    NaN
4    NaN
5    NaN
dtype: float64


Here, `NaN` (**N**ot **a** **N**umber) denotes a *missing value*.

`Series` also has useful methods:

In [9]:
a.sum()

11.8

In [10]:
a.min()

1.0

In [11]:
a.max()

5.5

In [12]:
a.mean()

2.95

#### DataFrame

A `DataFrame` is a block of values with named columns. All values in a column have the same type:

In [13]:
students = pd.DataFrame({'name': ['alex', 'bill', 'chris', 'dave'],
                         'major': ['data science', 'software', 
                                   'software', 'data science'],
                         'score': [1.0, 3.0, 5.0, 2.0]})
students

Unnamed: 0,major,name,score
0,data science,alex,1.0
1,software,bill,3.0
2,software,chris,5.0
3,data science,dave,2.0


Each column (列）in `DataFrame` is a `Series`, all share the same index. We can select columns using `.`-notation or by their name as an index:

In [14]:
print(students.name)
print(students['score'])

0     alex
1     bill
2    chris
3     dave
Name: name, dtype: object
0    1.0
1    3.0
2    5.0
3    2.0
Name: score, dtype: float64


Using index notation we can also add new columns:

In [15]:
students['passed'] = students['score'] < 5
students

Unnamed: 0,major,name,score,passed
0,data science,alex,1.0,True
1,software,bill,3.0,True
2,software,chris,5.0,False
3,data science,dave,2.0,True


We can select rows in a DataFrame by passing a series of truth values:

In [16]:
passed = students[students.passed]
passed

Unnamed: 0,major,name,score,passed
0,data science,alex,1.0,True
1,software,bill,3.0,True
3,data science,dave,2.0,True


In [17]:
good_students = students[students.score < 2.5]
good_students

Unnamed: 0,major,name,score,passed
0,data science,alex,1.0,True
3,data science,dave,2.0,True


In [18]:
students[ [True, False, True, False] ]

Unnamed: 0,major,name,score,passed
0,data science,alex,1.0,True
2,software,chris,5.0,False


#### Exercise:
Calculate the average score over all students:

In [20]:
students['score'].mean()

2.75

Add a column `good` to the DataFrame which is `True` is the student's score is less 
than the average score and `False` otherwise.

In [22]:
students['good'] = students['score'] < students['score'].mean()
students

Unnamed: 0,major,name,score,passed,good
0,data science,alex,1.0,True,True
1,software,bill,3.0,True,False
2,software,chris,5.0,False,False
3,data science,dave,2.0,True,True


## DataFrames and SQL

`select score from students` = `students['score']`

```
select * from students 
    where score < 2.5
```

```
students[students['score'] < 2.5]
```


```
select major, avg(score) from students group by major
```

### Grouping data

We often need to divide our data into groups and then will combine the data in each group. 

DataFrames provide us with the `groupby` method to do just this. 

For example, let us calculate the average score for each major:

In [23]:
students.groupby('major').mean()

Unnamed: 0_level_0,score,passed,good
major,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
data science,1.5,1.0,1.0
software,4.0,0.5,0.0


In fact we also have calculated the fraction of students which passed.

### Now let us have a look at the questions from last exercise

We can use `pd.read_csv` to open a csv file and read it into a DataFrame:

In [25]:
titanic_df = pd.read_csv('titanic.csv')
titanic_df.head(12)

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55
2,0,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55
3,0,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55
4,0,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583
7,0,1,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792
9,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042


The `head` function shows us the first couple of rows in a DataFrame.

Now let's have a look at the problems from last week:

**Calculate how many males and females were on board of the Titanic:**

In [26]:
titanic_df.groupby('sex').size()

sex
female    466
male      843
dtype: int64

**Calculate how many passengers were in each passenger class (pclass) and calculate the average fare for each passenger class:**

In [29]:
print(titanic_df.groupby('pclass').size())

titanic_df.groupby('pclass').mean()

pclass
1    323
2    277
3    709
dtype: int64


Unnamed: 0_level_0,survived,age,sibsp,parch,fare
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.619195,34.431631,0.436533,0.365325,87.508992
2,0.429603,27.802347,0.393502,0.368231,21.179196
3,0.255289,17.535966,0.568406,0.400564,13.295262


We can solve both problems in one step using the `agg` function. It expects a dictionary whose keys are the column names and whose values are the functions names:

In [31]:
titanic_df.groupby('pclass').agg({
    'fare': 'mean',
    'pclass': 'size'
})

Unnamed: 0_level_0,fare,pclass
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,87.508992,323
2,21.179196,277
3,13.295262,709


**For each of the age groups: 0-6, 7-12, 13-18, 18-60, 61-100 calculate how many passengers where in this age group and how many passengers of this age group survived.**

In [35]:
def age_group(age):
    age = float(age)
    if age < 7:
        return "0-6"
    if age < 13:
        return "7-12"
    if age < 18:
        return "13-18"
    if age <= 60:
        return "18-60"
    return "61-100"

titanic_df['ag'] = titanic_df.age.apply(age_group) 
titanic_df.groupby('ag').agg({
    'survived': 'mean',
    'ag': 'size'
})

Unnamed: 0_level_0,survived,ag
ag,Unnamed: 1_level_1,Unnamed: 2_level_1
0-6,0.347692,325
13-18,0.45,60
18-60,0.393481,859
61-100,0.242424,33
7-12,0.4375,32


The `apply` function is similar to `map` for list:

In [34]:
titanic_df

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,ag
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,18-60
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,0-6
2,0,1,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,0-6
3,0,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,18-60
4,0,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,18-60
5,1,1,"Anderson, Mr. Harry",male,48.0000,0,0,19952,26.5500,18-60
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0000,1,0,13502,77.9583,61-100
7,0,1,"Andrews, Mr. Thomas Jr",male,39.0000,0,0,112050,0.0000,18-60
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0000,2,0,11769,51.4792,18-60
9,0,1,"Artagaveytia, Mr. Ramon",male,71.0000,0,0,PC 17609,49.5042,61-100


**For each combination of sex and pclass calculate the change of survival.**

In [38]:
titanic_df.groupby(['sex', 'pclass']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,age,sibsp,parch,fare
sex,pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,1,0.965278,34.208333,0.555556,0.472222,109.412385
female,2,0.886792,26.720912,0.5,0.650943,23.234827
female,3,0.490741,15.611883,0.791667,0.731481,15.32425
male,1,0.340782,34.611266,0.340782,0.27933,69.888385
male,2,0.146199,28.472709,0.327485,0.192982,19.904946
male,3,0.15213,18.378972,0.470588,0.255578,12.406294


**Add a column `is_child` to the dataframe calculate the chance of survival for each combination of sex, pclass, and is_child:**

In [41]:
titanic_df['is_child'] = titanic_df['age'] < 18
titanic_df.groupby(['pclass', 'sex', 'is_child']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,survived,age,sibsp,parch,fare
pclass,sex,is_child,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,female,False,0.968,38.504,0.56,0.48,112.503368
1,female,True,0.947368,5.947368,0.526316,0.421053,89.076968
1,male,False,0.326389,42.545139,0.395833,0.263889,72.147802
1,male,True,0.4,1.969049,0.114286,0.342857,60.592497
2,female,False,0.870588,31.570588,0.482353,0.564706,22.265441
2,female,True,0.952381,7.091271,0.571429,1.0,27.158533
2,male,False,0.083916,33.395105,0.328671,0.132867,19.799795
2,male,True,0.464286,3.333332,0.321429,0.5,20.441964
3,female,False,0.443396,28.160377,0.424528,0.688679,13.081174
3,female,True,0.536364,3.519697,1.145455,0.772727,17.485759


In [58]:
surv = titanic_df.groupby(['pclass', 'sex', 'is_child']).mean()
surv

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,survived,age,sibsp,parch,fare
pclass,sex,is_child,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,female,False,0.968,38.504,0.56,0.48,112.503368
1,female,True,0.947368,5.947368,0.526316,0.421053,89.076968
1,male,False,0.326389,42.545139,0.395833,0.263889,72.147802
1,male,True,0.4,1.969049,0.114286,0.342857,60.592497
2,female,False,0.870588,31.570588,0.482353,0.564706,22.265441
2,female,True,0.952381,7.091271,0.571429,1.0,27.158533
2,male,False,0.083916,33.395105,0.328671,0.132867,19.799795
2,male,True,0.464286,3.333332,0.321429,0.5,20.441964
3,female,False,0.443396,28.160377,0.424528,0.688679,13.081174
3,female,True,0.536364,3.519697,1.145455,0.772727,17.485759


### Another example

Let's analyze some sales data:

In [59]:
sales=pd.read_csv("sample-sales-20.csv",parse_dates=['date'])
sales.head()

Unnamed: 0,account number,name,sku,category,quantity,unit price,ext price,date
0,296809,Carroll PLC,QN-82852,Belt,13,44.48,578.24,2014-09-27 07:13:03
1,98022,Heidenreich-Bosco,MJ-21460,Shoes,19,53.62,1018.78,2014-07-29 02:10:44
2,563905,"Kerluke, Reilly and Bechtelar",AS-93055,Shirt,12,24.16,289.92,2014-03-01 10:51:24
3,93356,Waters-Walker,AS-93055,Shirt,5,82.68,413.4,2013-11-17 20:41:11
4,659366,Waelchi-Fahey,AS-93055,Shirt,18,99.64,1793.52,2014-01-03 08:14:27


Using the `describe` method we can do some simple analysis:

In [60]:
sales.describe()

Unnamed: 0,account number,quantity,unit price,ext price
count,1000.0,1000.0,1000.0,1000.0
mean,535208.897,10.328,56.17963,579.8439
std,277589.746014,5.687597,25.331939,435.30381
min,93356.0,1.0,10.06,10.38
25%,299771.0,5.75,35.995,232.605
50%,563905.0,10.0,56.765,471.72
75%,750461.0,15.0,76.8025,878.1375
max,995267.0,20.0,99.97,1994.8


We can use index notation to get the values of a single column:

In [61]:
sales['unit price'].describe()

count    1000.000000
mean       56.179630
std        25.331939
min        10.060000
25%        35.995000
50%        56.765000
75%        76.802500
max        99.970000
Name: unit price, dtype: float64

``dtypes`` describes the datatypes in a dataframe: 

In [62]:
sales.dtypes

account number             int64
name                      object
sku                       object
category                  object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
dtype: object

Let us remove some columns to simplify our analysis:

In [63]:
customers = sales[['name','ext price','date']]
customers.head()

Unnamed: 0,name,ext price,date
0,Carroll PLC,578.24,2014-09-27 07:13:03
1,Heidenreich-Bosco,1018.78,2014-07-29 02:10:44
2,"Kerluke, Reilly and Bechtelar",289.92,2014-03-01 10:51:24
3,Waters-Walker,413.4,2013-11-17 20:41:11
4,Waelchi-Fahey,1793.52,2014-01-03 08:14:27


Each line in the dataframe is still for one order. Let us group the orders by customers and see how many orders each customer has done:

In [64]:
customer_group = customers.groupby('name')
customer_group.size()

name
Berge LLC                        52
Carroll PLC                      57
Cole-Eichmann                    51
Davis, Kshlerin and Reilly       41
Ernser, Cruickshank and Lind     47
Gorczany-Hahn                    42
Hamill-Hackett                   44
Hegmann and Sons                 58
Heidenreich-Bosco                40
Huel-Haag                        43
Kerluke, Reilly and Bechtelar    52
Kihn, McClure and Denesik        58
Kilback-Gerlach                  45
Koelpin PLC                      53
Kunze Inc                        54
Kuphal, Zieme and Kub            52
Senger, Upton and Breitenberg    59
Volkman, Goyette and Lemke       48
Waelchi-Fahey                    54
Waters-Walker                    50
dtype: int64

Let us now sum up the total values of each customer. This can be done using ``sum()``. We then sort he values and print the first 5:

In [65]:
sales_totals = customer_group.sum()
sales_totals.sort_values(by='ext price').head()

Unnamed: 0_level_0,ext price
name,Unnamed: 1_level_1
"Davis, Kshlerin and Reilly",19054.76
Huel-Haag,21087.88
Gorczany-Hahn,22207.9
Hamill-Hackett,23433.78
Heidenreich-Bosco,25428.29


We might be more interested in who our best customers are:

In [66]:
sales_totals.sort_values(by='ext price', ascending=False).head()

Unnamed: 0_level_0,ext price
name,Unnamed: 1_level_1
"Kihn, McClure and Denesik",38935.29
Waters-Walker,36778.96
Carroll PLC,35934.31
Hegmann and Sons,35213.72
Kunze Inc,34406.54


## Another Example

### Preparation: making some random data

We use the `random` package to generate some random data:

In [67]:
import random

names = ['Fred', 'Peter', 'Bill', 
        'Sammy', 'Tom', 'Sarah', 'Anthony', 'Barny', 'Philip', 'Zach',
        'Betty', 'Chris', 'Sarah', 'Tim', 'Dick', 'Donald', 'Angela']
genders = ['male', 'male', 'male', 
           'male', 'male', 'female', 'male', 'male', 'male', 'male',
          'female', 'male', 'female', 'male', 'male', 'male' ,'female']
majors = ['software', 'management', 'finance']

random.seed(42) #to get same values each time we run the code

header = "name,gender,major,year,q1,q2,q3,q4,q5,q6,hw1,hw2,hw3,exam\n"

with open('grades_new.csv', 'w', encoding='utf-8') as file:
    file.write(header)
    for name, gender in zip(names, genders):
        data = [name, gender, random.choice(majors), str(random.randint(2014, 2017))]
        grades = [str(random.randint(25, 100)) for i in range(10)]
        s = ','.join(data + grades) + "\n"
        file.write(s)

### Opening the data using pandas:

In [68]:
df = pd.read_csv('grades_new.csv')
df

Unnamed: 0,name,gender,major,year,q1,q2,q3,q4,q5,q6,hw1,hw2,hw3,exam
0,Fred,male,finance,2014,28,60,56,53,42,38,94,36,100,79
1,Peter,male,software,2014,36,52,54,89,28,96,50,94,78,53
2,Bill,male,management,2016,25,45,79,68,60,44,52,68,38,36
3,Sammy,male,management,2014,70,69,58,30,83,93,40,73,35,95
4,Tom,male,management,2016,98,49,33,30,54,62,35,54,37,73
5,Sarah,female,management,2017,71,45,72,70,51,59,34,46,93,56
6,Anthony,male,software,2017,73,59,96,53,66,32,54,29,65,76
7,Barny,male,management,2014,52,97,65,52,88,75,83,43,58,42
8,Philip,male,software,2016,99,79,99,76,71,53,42,90,88,36
9,Zach,male,software,2014,44,45,79,33,74,73,84,92,57,95


Now add a columns which contain the average quiz and homework grades:

In [71]:
#calculate average quiz grade:
df['avg_q'] = (df['q1'] + df['q2'] + df['q3'] + df['q4'] + df['q5'] + df['q6']) /6
df.head()

Unnamed: 0,name,gender,major,year,q1,q2,q3,q4,q5,q6,hw1,hw2,hw3,exam,avg_q
0,Fred,male,finance,2014,28,60,56,53,42,38,94,36,100,79,46.166667
1,Peter,male,software,2014,36,52,54,89,28,96,50,94,78,53,59.166667
2,Bill,male,management,2016,25,45,79,68,60,44,52,68,38,36,53.5
3,Sammy,male,management,2014,70,69,58,30,83,93,40,73,35,95,67.166667
4,Tom,male,management,2016,98,49,33,30,54,62,35,54,37,73,54.333333


In [79]:
df['avg_q'] = df.loc[:, 'q1':'q6'].mean(axis=1)
df['avg_hw'] = df.loc[:, 'hw1':'hw3'].mean(axis=1)

In [83]:
df['final'] =  df[['avg_q', 'avg_hw', 'exam']].mean(axis=1)

In [86]:
df['passed'] = df['final'] >= 60
df.groupby(['major', 'passed']).size()

major       passed
finance     True      2
management  False     5
            True      1
software    False     3
            True      6
dtype: int64

In [87]:
df

Unnamed: 0,name,gender,major,year,q1,q2,q3,q4,q5,q6,hw1,hw2,hw3,exam,avg_q,avg_hw,final,passed
0,Fred,male,finance,2014,28,60,56,53,42,38,94,36,100,79,46.166667,76.666667,67.277778,True
1,Peter,male,software,2014,36,52,54,89,28,96,50,94,78,53,59.166667,74.0,62.055556,True
2,Bill,male,management,2016,25,45,79,68,60,44,52,68,38,36,53.5,52.666667,47.388889,False
3,Sammy,male,management,2014,70,69,58,30,83,93,40,73,35,95,67.166667,49.333333,70.5,True
4,Tom,male,management,2016,98,49,33,30,54,62,35,54,37,73,54.333333,42.0,56.444444,False
5,Sarah,female,management,2017,71,45,72,70,51,59,34,46,93,56,61.333333,57.666667,58.333333,False
6,Anthony,male,software,2017,73,59,96,53,66,32,54,29,65,76,63.166667,49.333333,62.833333,True
7,Barny,male,management,2014,52,97,65,52,88,75,83,43,58,42,71.5,61.333333,58.277778,False
8,Philip,male,software,2016,99,79,99,76,71,53,42,90,88,36,79.5,73.333333,62.944444,True
9,Zach,male,software,2014,44,45,79,33,74,73,84,92,57,95,58.0,77.666667,76.888889,True


In [89]:
finals = df[['name', 'gender', 'major', 'year', 'final', 'passed']]

In [95]:
finals.to_excel('finals.xls')

In [96]:
df2 = pd.read_excel('finals.xls')

In [98]:
for x in df2:
    print(x)

name
gender
major
year
final
passed
