# Reading and Manipulating Data with pandas

### **Task 1**


1. This question doesn’t involve any programming. Suppose you have a pandas DataFrame. In a markdown cell, describe what each of the following operators/methods allows you to subset from a dataframe (that is, describe all of the types of subsetting you can do with the operator):

1) [] : Used for selecting columns, rows, or applying conditions.  
* **Indexing Columns**: `my_df['column1']` or `my_df[['column1', 'column2']`, but unable to use slicing for columns
* **Indexing Rows**: `my_df[start:end]` or using a Boolean Array like this `my_df['Name'] == 'Alice'`  

2) .iloc[] : To index both rows and columns at once
* **Indexing Rows**: Can access rows by their integer location `my_df.iloc[0]`, and can return more than one row by passing a list `my_df.iloc[[0,1]]` or `my_df.iloc[2:5]` or `my_df.iloc[range(0,3)]`
* **Returning Rows and Columns**: `my_df.iloc[[0,1], [0, 2]]` or `my_df.iloc[3:6, 0:2]`

3) .loc[] : .loc[] is similar to .iloc[] but it allows for subsetting based on labels or boolean arrays  
* **Indexing Rows**: `my_df.loc[0]` or `my_df2.loc["b"]` or `my_df.loc[2:5]` (include last value) or `my_df2.loc["b":"e"]` (include last value)
* **Rows and Columns**: `my_df.loc[:3, ['Name', "Major"]]` or `my_df.loc[:3, 'Name':"Major"]`
* **Using a Boolean**: `my_df.loc[my_df['Age'] > 21]`

### **Task 2**

 2. Read in the BreastCancer.dat data file available in the assignment link. (Open the file in a program such as notepad or wordpad to determine the delimiter- although a program like notepad++ is a better choice.) Upload this file to your Colab notebook and read it in using a relative file path (just
 the file name).

In [None]:
import pandas as pd
import os
os.getcwd()

'/content'

In [None]:
# a. Save the data as an object called cancer_data.
cancer_data = pd.read_csv('BreastCancer.dat', sep="\t")

# b. Use the .head() method to look at the data.
cancer_data.head()

Unnamed: 0,id,age,meno,size,grade,gradd1,gradd2,nodes,enodes,pgr,er,hormon,rectime,censrec,_st,_d,_t,_t0
0,132,49,premenopausal,18,2,1,0,2,0.786628,0,0,no tamoxifen,1838,0,1,0,5.03217,0
1,1575,55,Postmenopausal,20,3,1,1,16,0.146607,0,0,no tamoxifen,403,1,1,1,1.103354,0
2,1140,56,Postmenopausal,40,3,1,1,3,0.697676,0,0,no tamoxifen,1603,0,1,0,4.388775,0
3,769,45,premenopausal,25,3,1,1,1,0.886921,0,4,no tamoxifen,177,0,1,0,0.4846,0
4,130,65,Postmenopausal,30,2,1,0,5,0.548812,0,36,had tamoxifen,1855,0,1,0,5.078713,0


In [None]:
# c. Return just the grade column using the column attribute
cancer_data.grade

Unnamed: 0,grade
0,2
1,3
2,3
3,3
4,2
...,...
681,3
682,2
683,3
684,2


In [None]:
#  d. Use the .loc[] method to print out all rows where the size is larger than 30.
cancer_data.loc[cancer_data['size'] > 30]

Unnamed: 0,id,age,meno,size,grade,gradd1,gradd2,nodes,enodes,pgr,er,hormon,rectime,censrec,_st,_d,_t,_t0
2,1140,56,Postmenopausal,40,3,1,1,3,0.697676,0,0,no tamoxifen,1603,0,1,0,4.388775,0
5,1642,48,premenopausal,52,2,1,0,11,0.267135,0,0,no tamoxifen,842,1,1,1,2.305270,0
14,820,32,premenopausal,57,3,1,1,24,0.056135,0,13,no tamoxifen,448,1,1,1,1.226557,0
19,1185,58,Postmenopausal,45,3,1,1,4,0.618783,0,0,no tamoxifen,370,1,1,1,1.013005,0
30,777,49,premenopausal,55,2,1,0,7,0.431710,0,0,no tamoxifen,679,1,1,1,1.859001,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
672,157,59,Postmenopausal,45,2,1,0,6,0.486752,739,526,no tamoxifen,740,0,1,0,2.026010,0
673,1137,47,premenopausal,70,2,1,0,5,0.548812,796,24,no tamoxifen,1703,0,1,0,4.662560,0
674,1171,48,Postmenopausal,35,1,0,0,2,0.786628,845,105,had tamoxifen,1578,0,1,0,4.320328,0
675,1178,64,Postmenopausal,35,1,0,0,4,0.618783,858,15,no tamoxifen,722,0,1,0,1.976728,0


In [None]:
# e. Use the .loc[] method to print out all rows where the size is greater than 30 and the grade is 3.
cancer_data.loc[(cancer_data['size'] > 30) & (cancer_data['grade'] == 3)]

Unnamed: 0,id,age,meno,size,grade,gradd1,gradd2,nodes,enodes,pgr,er,hormon,rectime,censrec,_st,_d,_t,_t0
2,1140,56,Postmenopausal,40,3,1,1,3,0.697676,0,0,no tamoxifen,1603,0,1,0,4.388775,0
14,820,32,premenopausal,57,3,1,1,24,0.056135,0,13,no tamoxifen,448,1,1,1,1.226557,0
19,1185,58,Postmenopausal,45,3,1,1,4,0.618783,0,0,no tamoxifen,370,1,1,1,1.013005,0
32,588,56,Postmenopausal,40,3,1,1,1,0.886921,0,3,no tamoxifen,1013,0,1,0,2.773443,0
37,735,48,premenopausal,45,3,1,1,1,0.886921,0,0,no tamoxifen,1862,0,1,0,5.097878,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
597,1239,37,premenopausal,35,3,1,1,1,0.886921,242,67,no tamoxifen,205,1,1,1,0.561259,0
637,884,53,premenopausal,75,3,1,1,19,0.102284,375,107,no tamoxifen,600,1,1,1,1.642710,0
652,1488,66,Postmenopausal,42,3,1,1,11,0.267135,412,339,had tamoxifen,394,1,1,1,1.078713,0
664,346,63,Postmenopausal,45,3,1,1,2,0.786628,530,328,no tamoxifen,2148,0,1,0,5.880903,0


In [None]:
# f. Use [] to return just the age, size, and grade columns.
cancer_data[['age', 'size', 'grade']]

Unnamed: 0,age,size,grade
0,49,18,2
1,55,20,3
2,56,40,3
3,45,25,3
4,65,30,2
...,...,...,...
681,51,30,3
682,64,26,2
683,57,35,3
684,44,21,2


In [None]:
# g. Use .loc[] to return the rows where meno is equal to premenopausal along with the age, size,and grade columns.
cancer_data.loc[cancer_data['meno'] == 'premenopausal', ['age', 'size', 'grade']]

Unnamed: 0,age,size,grade
0,49,18,2
3,45,25,3
5,48,52,2
6,48,21,3
7,37,20,2
...,...,...,...
669,38,61,2
673,47,70,2
679,43,20,2
681,51,30,3


### **Task 3**

3. There are two files about mosquitos available at:
* https://www4.stat.ncsu.edu/~online/datasets/mosquito.txt
* https://www4.stat.ncsu.edu/~online/datasets/mosquito2.txt

In [None]:
# a. Determine the delimiter and read in the mosquito.txt file as an object called mosq_data.
url1 = "https://www4.stat.ncsu.edu/~online/datasets/mosquito.txt"
mosq_data = pd.read_csv(url1, sep="&")

#  b. Similarly, read in the mosquito2.txt file. Note this file doesn’t contain column names! The columns
# are the same as the other file though. Use an attribute from mosq_data to assign the column
# names as you read in the data. Save this data as an object called mosq_data2.
url2 = "https://www4.stat.ncsu.edu/~online/datasets/mosquito2.txt"
mosq_data2 = pd.read_csv(url2,
                         sep="\t",
                         header=None,
                         names=["Day","Cage","trt","Response"])

print(mosq_data.head())
print(mosq_data2.tail())

   Day  Cage trt  Response
0    1     1  T2        42
1    2     1  T1        44
2    2     2  T2        42
3    3     1  T1        25
4    3     2  T3        33
    Day  Cage trt  Response
25   28     2  T3        39
26   29     1  T2        33
27   29     2  T3        39
28   30     1  T1        42
29   30     2  T2        42


In [None]:
# c. Combine the two datasets into one data frame using the concat() function from pandas (see
# https://pandas.pydata.org/docs/reference/api/pandas.concat.html).
combine_mosq = pd.concat([mosq_data, mosq_data2], ignore_index=True)
combine_mosq.tail()

Unnamed: 0,Day,Cage,trt,Response
62,28,2,T3,39
63,29,1,T2,33
64,29,2,T3,39
65,30,1,T1,42
66,30,2,T2,42


# Summarizing Data Numerically

For this part, we’ll use the StudentData.txt data that comes from the UCI machine learning repository.Information about the variables in the dataset can be found here. I want you to look at the math scores data set.

You should read up on the variables. The dataset is generally about math scores (G1, G2, G3) for students from two different schools. They also measure a bunch of things about the students’ home life. Hopefully you can make some interesting connections!

### Task 1: Read in the data
* You can either read this dataset from the URL or download it and read it in locally. Check out the first few observations of the data.

In [None]:
url3 = "https://www4.stat.ncsu.edu/online/datasets/StudentData.txt"
student_data = pd.read_csv(url3, sep=";")
student_data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


### Task 2: Summarize the Data
This data has many categorical variables and a few numeric. You should do the following:

**Categorical variables**
* Create a one-way contingency table, a two-way contingency table, and a three-way contingency table  
– Interpret a number from each resulting table (that is, pick out a value produced and explain what that value means.)
* Create a conditional two-way table. That is, condition on one variable’s setting and create a two-way table. Do this using two different methods:  
– Once, by subsetting the data (say with .loc) and then creating the two-way table  
– Once, by creating a three-way table and subsetting it

In [None]:
student_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

In [None]:
# one-way contingency table
student_data.sex.value_counts()

Unnamed: 0_level_0,count
sex,Unnamed: 1_level_1
F,208
M,187


* 208: There are 208 female students.

In [None]:
# two-way contingency table
pd.crosstab(
    student_data.sex,
    student_data.school)

school,GP,MS
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
F,183,25
M,166,21


* 183: 183 female students attend GP school.

In [None]:
# three-way contingency table
pd.crosstab(
    [student_data.sex, student_data.school],
    student_data.guardian,
    margins = True)

Unnamed: 0_level_0,guardian,father,mother,other,All
sex,school,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
F,GP,37,128,18,183
F,MS,6,17,2,25
M,GP,41,117,8,166
M,MS,6,11,4,21
All,,90,273,32,395


* 37: 37 female students who attend GP school have their guardian as their father.

In [None]:
# Create a conditional two-way table.
# Once, by subsetting the data (say with .loc) and then creating the two-way table
subset = student_data.loc[student_data["sex"] == "M"]

pd.crosstab(
    subset.school,
    subset.guardian,
    margins = True)

guardian,father,mother,other,All
school,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GP,41,117,8,166
MS,6,11,4,21
All,47,128,12,187


In [None]:
# Once, by creating a three-way table and subsetting it
three_way = pd.crosstab(
    [student_data.sex, student_data.school],
    student_data.guardian,
    margins = True)

three_way.loc[('F', ("GP", "MS")), :]

Unnamed: 0_level_0,guardian,father,mother,other,All
sex,school,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
F,GP,37,128,18,183
F,MS,6,17,2,25


**Numeric variables (and across groups)**  
The numeric variables are age, absences, and the three test grades variables (G1, G2, and G3).
* Find measures of center and spread for three of these variables (including G3 as one of them)  
– Repeat while subsetting the data by some grouping variable (say with .loc)
* Find measures of center and spread across a single grouping variable for three of these variables (including G3 as one of them)
* Find measures of center and spread across two grouping variables for three of these variables (including G3 as one of them)
* Create a correlation matrix between all of the numeric variables

In [None]:
print(student_data.age.mean())
print(student_data.absences.mean())
print(student_data.G3.mean())
print(" ")
print(student_data.age.median())
print(student_data.absences.median())
print(student_data.G3.median())

16.696202531645568
5.708860759493671
10.415189873417722
 
17.0
4.0
11.0


In [None]:
student_data.age.quantile(q = [0.2, 0.25, 0.5, 0.95])

Unnamed: 0,age
0.2,15.0
0.25,16.0
0.5,17.0
0.95,19.0


In [None]:
student_data.absences.quantile(q = [0.2, 0.25, 0.5, 0.95])

Unnamed: 0,absences
0.2,0.0
0.25,0.0
0.5,4.0
0.95,18.3


In [None]:
student_data.G3.quantile(q = [0.2, 0.25, 0.5, 0.95])

Unnamed: 0,G3
0.2,8.0
0.25,8.0
0.5,11.0
0.95,17.0


In [None]:
# Repeat while subsetting the data by some grouping variable (say with .loc)
subset3 = student_data.loc[:, ['age','absences','G3']]

print(subset3.age.mean())
print(subset3.absences.mean())
print(subset3.G3.mean())
print(" ")
print(subset3.age.median())
print(subset3.absences.median())
print(subset3.G3.median())

16.696202531645568
5.708860759493671
10.415189873417722
 
17.0
4.0
11.0


In [None]:
# Find measures of center and spread across a single grouping variable for three of these variables
# (including G3 as one of them)
student_data.groupby('age')[['absences','G3']].mean()

Unnamed: 0_level_0,absences,G3
age,Unnamed: 1_level_1,Unnamed: 2_level_1
15,3.341463,11.256098
16,5.451923,11.028846
17,6.459184,10.27551
18,6.04878,9.54878
19,10.458333,8.208333
20,5.0,14.0
21,3.0,7.0
22,16.0,8.0


In [None]:
student_data.groupby('age')[['absences','G3']].median()

Unnamed: 0_level_0,absences,G3
age,Unnamed: 1_level_1,Unnamed: 2_level_1
15,2.0,11.0
16,4.0,11.0
17,4.0,11.0
18,4.0,10.0
19,6.5,9.0
20,4.0,15.0
21,3.0,7.0
22,16.0,8.0


In [None]:
student_data.groupby('age')[['absences','G3']].std()

Unnamed: 0_level_0,absences,G3
age,Unnamed: 1_level_1,Unnamed: 2_level_1
15,4.074066,4.596861
16,7.307238,4.282402
17,8.49884,4.290437
18,9.395169,4.9942
19,11.394046,4.606037
20,5.567764,4.582576
21,,
22,,


In [None]:
student_data.age.value_counts()

Unnamed: 0_level_0,count
age,Unnamed: 1_level_1
16,104
17,98
18,82
15,82
19,24
20,3
22,1
21,1


In [None]:
# Find measures of center and spread across two grouping variables for three of these variables
# (including G3 as one of them)
student_data.groupby(['age', 'absences'])['G3'].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,G3
age,absences,Unnamed: 2_level_1
15,0,10.214286
15,2,11.500000
15,4,14.000000
15,6,12.200000
15,8,9.714286
...,...,...
20,0,18.000000
20,4,15.000000
20,11,9.000000
21,3,7.000000


In [None]:
student_data.groupby(['age', 'absences'])['G3'].median()

Unnamed: 0_level_0,Unnamed: 1_level_0,G3
age,absences,Unnamed: 2_level_1
15,0,11.0
15,2,11.0
15,4,14.0
15,6,13.0
15,8,10.0
...,...,...
20,0,18.0
20,4,15.0
20,11,9.0
21,3,7.0


In [None]:
student_data.groupby(['age', 'absences'])['G3'].std()

Unnamed: 0_level_0,Unnamed: 1_level_0,G3
age,absences,Unnamed: 2_level_1
15,0,6.154535
15,2,3.391165
15,4,2.121320
15,6,3.552777
15,8,2.360387
...,...,...
20,0,
20,4,
20,11,
21,3,


In [None]:
# Create a correlation matrix between all of the numeric variables
student_data.loc[:, ['age','absences','G1', 'G2', 'G3']].corr()

Unnamed: 0,age,absences,G1,G2,G3
age,1.0,0.17523,-0.064081,-0.143474,-0.161579
absences,0.17523,1.0,-0.031003,-0.031777,0.034247
G1,-0.064081,-0.031003,1.0,0.852118,0.801468
G2,-0.143474,-0.031777,0.852118,1.0,0.904868
G3,-0.161579,0.034247,0.801468,0.904868,1.0
