Skip to content

Commit fee243b

Browse files
authored
Merge pull request #10 from gjbex/development
Development
2 parents 84b42b8 + 1501833 commit fee243b

File tree

4 files changed

+303
-39
lines changed

4 files changed

+303
-39
lines changed

python_for_data_science.pptx

-22 Bytes
Binary file not shown.

source-code/pandas/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,4 +23,5 @@ easy to use.
2323
1. `patient_data.ipynb`: extended version of therunninng example used
2424
in the Python slides.
2525
1. `bokeh_plot.ipynb`: using Bokeh as a plotting beackdnd for pandas.
26+
1. `pipes.ipynb`: consolidating data processing using pipes.
2627
1. `screenshots`: screenshots made for the slides.

source-code/pandas/patient_data.ipynb

Lines changed: 40 additions & 39 deletions
Large diffs are not rendered by default.

source-code/pandas/pipes.ipynb

Lines changed: 262 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,262 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "b8acc759-2d21-4ac1-a64a-e338fa7f516c",
6+
"metadata": {},
7+
"source": [
8+
"# Requirements"
9+
]
10+
},
11+
{
12+
"cell_type": "code",
13+
"execution_count": 1,
14+
"id": "09782f37-83d3-4670-a6b4-5030c7a0717d",
15+
"metadata": {},
16+
"outputs": [],
17+
"source": [
18+
"import pandas as pd"
19+
]
20+
},
21+
{
22+
"cell_type": "markdown",
23+
"id": "75061e26-7b02-4955-a623-f085bdd461e0",
24+
"metadata": {},
25+
"source": [
26+
"# Data"
27+
]
28+
},
29+
{
30+
"cell_type": "markdown",
31+
"id": "10c99a56-bf6b-4135-8a89-44a8caa38d63",
32+
"metadata": {},
33+
"source": [
34+
"Read the patient experiment data."
35+
]
36+
},
37+
{
38+
"cell_type": "code",
39+
"execution_count": 2,
40+
"id": "55bc641a-bd06-4a5e-8d58-c8e177e62d89",
41+
"metadata": {},
42+
"outputs": [],
43+
"source": [
44+
"data = pd.read_excel('data/patient_experiment.xlsx')"
45+
]
46+
},
47+
{
48+
"cell_type": "code",
49+
"execution_count": 3,
50+
"id": "267a9df3-f049-4ff9-96c8-d4af5d1da64c",
51+
"metadata": {},
52+
"outputs": [
53+
{
54+
"name": "stdout",
55+
"output_type": "stream",
56+
"text": [
57+
"<class 'pandas.core.frame.DataFrame'>\n",
58+
"RangeIndex: 62 entries, 0 to 61\n",
59+
"Data columns (total 4 columns):\n",
60+
" # Column Non-Null Count Dtype \n",
61+
"--- ------ -------------- ----- \n",
62+
" 0 patient 62 non-null int64 \n",
63+
" 1 dose 61 non-null float64 \n",
64+
" 2 date 62 non-null datetime64[ns]\n",
65+
" 3 temperature 61 non-null float64 \n",
66+
"dtypes: datetime64[ns](1), float64(2), int64(1)\n",
67+
"memory usage: 2.1 KB\n"
68+
]
69+
}
70+
],
71+
"source": [
72+
"data.info()"
73+
]
74+
},
75+
{
76+
"cell_type": "markdown",
77+
"id": "54f7cce5-d811-4def-a99d-53c2c5893126",
78+
"metadata": {},
79+
"source": [
80+
"The first step is transforming the data into a time series."
81+
]
82+
},
83+
{
84+
"cell_type": "code",
85+
"execution_count": 4,
86+
"id": "81c6819c-412f-4787-8e44-cef39b035ca4",
87+
"metadata": {},
88+
"outputs": [],
89+
"source": [
90+
"def create_time_series(df):\n",
91+
" return df.pivot_table(index='date', columns=['patient'])"
92+
]
93+
},
94+
{
95+
"cell_type": "markdown",
96+
"id": "f2a9d1c9-d8a8-4df4-ba2a-0faf2250a11c",
97+
"metadata": {},
98+
"source": [
99+
"Next, we should deal with missing data by interpolation."
100+
]
101+
},
102+
{
103+
"cell_type": "code",
104+
"execution_count": 5,
105+
"id": "4c6e3606-cb4f-48a5-808f-1d4d65649ce1",
106+
"metadata": {},
107+
"outputs": [],
108+
"source": [
109+
"def impute(df):\n",
110+
" return df.interpolate()"
111+
]
112+
},
113+
{
114+
"cell_type": "markdown",
115+
"id": "6e63b2af-4d79-49e6-9f93-bbd6966a60a6",
116+
"metadata": {},
117+
"source": [
118+
"Finally, we compute the mean value of the temperatures across all patients for each time step. Note that the name of the column is a parameter."
119+
]
120+
},
121+
{
122+
"cell_type": "code",
123+
"execution_count": 6,
124+
"id": "1702df3c-65b1-45dd-8280-c87e300ac9a6",
125+
"metadata": {},
126+
"outputs": [],
127+
"source": [
128+
"def compute_mean(df, column):\n",
129+
" df['avg_temp'] = df[column].mean(axis=1)\n",
130+
" return df"
131+
]
132+
},
133+
{
134+
"cell_type": "markdown",
135+
"id": "2fc4796e-fb31-4780-819e-c1680e3608cb",
136+
"metadata": {},
137+
"source": [
138+
"All these operations can be chained using pipes."
139+
]
140+
},
141+
{
142+
"cell_type": "code",
143+
"execution_count": 7,
144+
"id": "6609a625-c8ed-4695-8bd4-74d9cd7af05b",
145+
"metadata": {},
146+
"outputs": [],
147+
"source": [
148+
"time_series = data.pipe(create_time_series) \\\n",
149+
" .pipe(impute) \\\n",
150+
" .pipe(compute_mean, 'temperature')"
151+
]
152+
},
153+
{
154+
"cell_type": "code",
155+
"execution_count": 8,
156+
"id": "3f6beb98-9d04-4d87-84f1-a7db0bbcd46c",
157+
"metadata": {},
158+
"outputs": [
159+
{
160+
"name": "stdout",
161+
"output_type": "stream",
162+
"text": [
163+
"<class 'pandas.core.frame.DataFrame'>\n",
164+
"DatetimeIndex: 7 entries, 2012-10-02 10:00:00 to 2012-10-02 16:00:00\n",
165+
"Data columns (total 19 columns):\n",
166+
" # Column Non-Null Count Dtype \n",
167+
"--- ------ -------------- ----- \n",
168+
" 0 (dose, 1) 7 non-null float64\n",
169+
" 1 (dose, 2) 7 non-null float64\n",
170+
" 2 (dose, 3) 7 non-null float64\n",
171+
" 3 (dose, 4) 7 non-null float64\n",
172+
" 4 (dose, 5) 7 non-null float64\n",
173+
" 5 (dose, 6) 7 non-null float64\n",
174+
" 6 (dose, 7) 7 non-null float64\n",
175+
" 7 (dose, 8) 7 non-null float64\n",
176+
" 8 (dose, 9) 7 non-null float64\n",
177+
" 9 (temperature, 1) 7 non-null float64\n",
178+
" 10 (temperature, 2) 7 non-null float64\n",
179+
" 11 (temperature, 3) 7 non-null float64\n",
180+
" 12 (temperature, 4) 7 non-null float64\n",
181+
" 13 (temperature, 5) 7 non-null float64\n",
182+
" 14 (temperature, 6) 7 non-null float64\n",
183+
" 15 (temperature, 7) 7 non-null float64\n",
184+
" 16 (temperature, 8) 7 non-null float64\n",
185+
" 17 (temperature, 9) 7 non-null float64\n",
186+
" 18 (avg_temp, ) 7 non-null float64\n",
187+
"dtypes: float64(19)\n",
188+
"memory usage: 1.1 KB\n"
189+
]
190+
}
191+
],
192+
"source": [
193+
"time_series.info()"
194+
]
195+
},
196+
{
197+
"cell_type": "markdown",
198+
"id": "0dced61d-d65a-4e5b-9698-545d742d384a",
199+
"metadata": {},
200+
"source": [
201+
"The original dataframe is unchanged."
202+
]
203+
},
204+
{
205+
"cell_type": "code",
206+
"execution_count": 9,
207+
"id": "42e86dc9-5dea-4472-8241-159200c55dac",
208+
"metadata": {},
209+
"outputs": [
210+
{
211+
"name": "stdout",
212+
"output_type": "stream",
213+
"text": [
214+
"<class 'pandas.core.frame.DataFrame'>\n",
215+
"RangeIndex: 62 entries, 0 to 61\n",
216+
"Data columns (total 4 columns):\n",
217+
" # Column Non-Null Count Dtype \n",
218+
"--- ------ -------------- ----- \n",
219+
" 0 patient 62 non-null int64 \n",
220+
" 1 dose 61 non-null float64 \n",
221+
" 2 date 62 non-null datetime64[ns]\n",
222+
" 3 temperature 61 non-null float64 \n",
223+
"dtypes: datetime64[ns](1), float64(2), int64(1)\n",
224+
"memory usage: 2.1 KB\n"
225+
]
226+
}
227+
],
228+
"source": [
229+
"data.info()"
230+
]
231+
},
232+
{
233+
"cell_type": "code",
234+
"execution_count": null,
235+
"id": "77ff0a74-c286-412f-8ea9-6df113efd4fe",
236+
"metadata": {},
237+
"outputs": [],
238+
"source": []
239+
}
240+
],
241+
"metadata": {
242+
"kernelspec": {
243+
"display_name": "Python 3",
244+
"language": "python",
245+
"name": "python3"
246+
},
247+
"language_info": {
248+
"codemirror_mode": {
249+
"name": "ipython",
250+
"version": 3
251+
},
252+
"file_extension": ".py",
253+
"mimetype": "text/x-python",
254+
"name": "python",
255+
"nbconvert_exporter": "python",
256+
"pygments_lexer": "ipython3",
257+
"version": "3.7.7"
258+
}
259+
},
260+
"nbformat": 4,
261+
"nbformat_minor": 5
262+
}

0 commit comments

Comments
 (0)