/
01-intro-python.html
369 lines (361 loc) · 24.4 KB
/
01-intro-python.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Intermediate Python</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">Intermediate Python</h1></a>
<h2 class="subtitle">Analyzing Mosquito Data</h2>
<h2 id="introduction">Introduction</h2>
<p>This material assumes that you have programmed before. This first lecture provides a quick introduction to programming in Python for those who either haven’t used Python before or need a quick refresher.</p>
<p>Let’s start with a hypothetical problem we want to solve. We are interested in understanding the relationship between the weather and the number of mosquitos occuring in a particular year so that we can plan mosquito control measures accordingly. Since we want to apply these mosquito control measures at a number of different sites we need to understand both the relationship at a particular site and whether or not it is consistent across sites. The data we have to address this problem comes from the local government and are stored in tables in comma-separated values (CSV) files. Each file holds the data for a single location, each row holds the information for a single year at that location, and the columns hold the data on both mosquito numbers and the average temperature and rainfall from the beginning of mosquito breeding season. The first few rows of our first file look like:</p>
<pre><code>year,temperature,rainfall,mosquitos
2001,80,157,150
2002,85,252,2177
2003,86,154,153</code></pre>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Conduct variable assignment, looping, and conditionals in Python</li>
<li>Use an external Python library</li>
<li>Read tabular data from a file</li>
<li>Subset and perform analysis on data</li>
<li>Display simple graphs</li>
</ul>
</div>
</section>
<h2 id="loading-data">Loading Data</h2>
<p>In order to load the data, we need to import a library called Pandas that knows how to operate on tables of data.</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="ch">import</span> pandas</code></pre>
<p>We can now use Pandas to read our data file.</p>
<pre class="sourceCode python"><code class="sourceCode python">pandas.read_csv(<span class="st">'A1_mosquito_data.csv'</span>)</code></pre>
<pre class="output"><code> year temperature rainfall mosquitos
0 2001 80 157 150
1 2002 85 252 217
2 2003 86 154 153
3 2004 87 159 158
4 2005 74 292 243
5 2006 75 283 237
6 2007 80 214 190
7 2008 85 197 181
8 2009 74 231 200
9 2010 74 207 184</code></pre>
<p>The <code>read_csv()</code> function belongs to the <code>pandas</code> library. In order to run it we need to tell Python that it is part of <code>pandas</code> and we do this using the dot notation, which is used everywhere in Python to refer to parts of larger things.</p>
<p>When we are finished typing and press Shift+Enter, the notebook runs our command and shows us its output. In this case, the output is the data we just loaded.</p>
<p>Our call to <code>pandas.read_csv()</code> read data into memory, but didn’t save it anywhere. To do that, we need to assign the array to a variable. In Python we use <code>=</code> to assign a new value to a variable like this:</p>
<pre class="sourceCode python"><code class="sourceCode python">data = pandas.read_csv(<span class="st">'A1_mosquito_data.csv'</span>)</code></pre>
<p>This statement doesn’t produce any output because assignment doesn’t display anything. If we want to check that our data has been loaded, we can print the variable’s value:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> data</code></pre>
<pre class="output"><code> year temperature rainfall mosquitos
0 2001 80 157 150
1 2002 85 252 217
2 2003 86 154 153
3 2004 87 159 158
4 2005 74 292 243
5 2006 75 283 237
6 2007 80 214 190
7 2008 85 197 181
8 2009 74 231 200
9 2010 74 207 184
</code></pre>
<p><code>print data</code> tells Python to display the text. Alternatively we could just include <code>data</code> as the last value in a code cell:</p>
<pre class="sourceCode python"><code class="sourceCode python">data</code></pre>
<pre class="output"><code> year temperature rainfall mosquitos
0 2001 80 157 150
1 2002 85 252 217
2 2003 86 154 153
3 2004 87 159 158
4 2005 74 292 243
5 2006 75 283 237
6 2007 80 214 190
7 2008 85 197 181
8 2009 74 231 200
9 2010 74 207 184</code></pre>
<p>This tells the IPython Notebook to display the <code>data</code> object, which is why we see a pretty formated table.</p>
<h2 id="manipulating-data">Manipulating data</h2>
<p>Once we have imported the data we can start doing things with it. First, let’s ask what type of thing <code>data</code> refers to:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> <span class="dt">type</span>(data)</code></pre>
<pre class="output"><code><class 'pandas.core.frame.DataFrame'>
</code></pre>
<p>The data is stored in a data structure called a DataFrame. There are other kinds of data structures that are also commonly used in scientific computing including Numpy arrays, and Numpy matrices, which can be used for doing linear algebra.</p>
<p>We can select an individual column of data using its name:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> data[<span class="st">'year'</span>]</code></pre>
<pre class="output"><code>0 2001
1 2002
2 2003
3 2004
4 2005
5 2006
6 2007
7 2008
8 2009
9 2010
Name: year, dtype: int64
</code></pre>
<p>Or we can select several columns of data at once:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> data[[<span class="st">'rainfall'</span>, <span class="st">'temperature'</span>]]</code></pre>
<pre class="output"><code> rainfall temperature
0 157 80
1 252 85
2 154 86
3 159 87
4 292 74
5 283 75
6 214 80
7 197 85
8 231 74
9 207 74
</code></pre>
<p>We can also select subsets of rows using slicing. Say we just want the first two rows of data:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> data[<span class="dv">0</span>:<span class="dv">2</span>]</code></pre>
<pre class="output"><code> year temperature rainfall mosquitos
0 2001 80 157 150
1 2002 85 252 217
</code></pre>
<p>There are a couple of important things to note here. First, Python indexing starts at zero. In contrast, programming languages like R and MATLAB start counting at 1, because that’s what human beings have done for thousands of years. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do. This means that if we have 5 things in Python they are numbered 0, 1, 2, 3, 4, and the first row in a data frame is always row 0.</p>
<p>The other thing to note is that the subset of rows starts at the first value and goes up to, but does not include, the second value. Again, the up-to-but-not-including takes a bit of getting used to, but the rule is that the difference between the upper and lower bounds is the number of values in the slice.</p>
<p>One thing that we can’t do with this syntax is directly ask for the data from a single row:</p>
<pre class="sourceCode python"><code class="sourceCode python">data[<span class="dv">1</span>]</code></pre>
<pre class="output"><code>---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-10-c805864c0d75> in <module>()
----> 1 data[1]
/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
2001 # get column
2002 if self.columns.is_unique:
-> 2003 return self._get_item_cache(key)
2004
2005 # duplicate columns
/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
665 return cache[item]
666 except Exception:
--> 667 values = self._data.get(item)
668 res = self._box_item_values(item, values)
669 cache[item] = res
/usr/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item)
1653 def get(self, item):
1654 if self.items.is_unique:
-> 1655 _, block = self._find_block(item)
1656 return block.get(item)
1657 else:
/usr/lib/python2.7/dist-packages/pandas/core/internals.pyc in _find_block(self, item)
1933
1934 def _find_block(self, item):
-> 1935 self._check_have(item)
1936 for i, block in enumerate(self.blocks):
1937 if item in block:
/usr/lib/python2.7/dist-packages/pandas/core/internals.pyc in _check_have(self, item)
1940 def _check_have(self, item):
1941 if item not in self.items:
-> 1942 raise KeyError('no item named %s' % com.pprint_thing(item))
1943
1944 def reindex_axis(self, new_axis, method=None, axis=0, copy=True):
KeyError: u'no item named 1'</code></pre>
<p>This is because there are several things that we could mean by <code>data[1]</code> so if we want a single row we can either take a slice that returns a single row:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> data[<span class="dv">1</span>:<span class="dv">2</span>]</code></pre>
<pre class="output"><code> year temperature rainfall mosquitos
1 2002 85 252 217
</code></pre>
<p>or use the <code>.iloc</code> method, which stands for “integer location” since we are looking up the row based on its integer index.</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> data.iloc[<span class="dv">1</span>]</code></pre>
<pre class="output"><code>year 2002
temperature 85
rainfall 252
mosquitos 217
Name: 1, dtype: int64
</code></pre>
<p>We can also use this same syntax for getting larger subsets of rows:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> data.iloc[<span class="dv">1</span>:<span class="dv">3</span>]</code></pre>
<pre class="output"><code> year temperature rainfall mosquitos
1 2002 85 252 217
2 2003 86 154 153
</code></pre>
<p>We can also subset the data based on the value of other rows:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> data[<span class="st">'temperature'</span>][data[<span class="st">'year'</span>] > <span class="dv">2005</span>]</code></pre>
<pre class="output"><code>5 75
6 80
7 85
8 74
9 74
Name: temperature, dtype: int64
</code></pre>
<p>Data frames also know how to perform common mathematical operations on their values. If we want to find the average value for each variable, we can just ask the data frame for its mean values</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> data.mean()</code></pre>
<pre class="output"><code>year 2005.5
temperature 80.0
rainfall 214.6
mosquitos 191.3
dtype: float64
</code></pre>
<p>Data frames have lots of useful methods:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> data.<span class="dt">max</span>()</code></pre>
<pre class="output"><code>year 2010
temperature 87
rainfall 292
mosquitos 243
dtype: int64
</code></pre>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> data[<span class="st">'temperature'</span>].<span class="dt">min</span>()</code></pre>
<pre class="output"><code>74
</code></pre>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> data[<span class="st">'mosquitos'</span>][<span class="dv">1</span>:<span class="dv">3</span>].std()</code></pre>
<pre class="output"><code>45.2548339959
</code></pre>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge"><span class="glyphicon glyphicon-pencil"></span>Challenge</h2>
</div>
<div class="panel-body">
<p>Import the data from <code>A2_mosquito_data.csv</code>, create a new variable that holds a data frame with only the weather data, and print the means and standard deviations for the weather variables.</p>
</div>
</section>
<h2 id="loops">Loops</h2>
<p>Once we have some data we often want to be able to loop over it to perform the same operation repeatedly. A <code>for</code> loop in Python takes the general form</p>
<pre><code>for item in list:
do_something</code></pre>
<p>So if we want to loop over the temperatures and print out there values in degrees Celcius (instead of Farenheit) we can use:</p>
<pre class="sourceCode python"><code class="sourceCode python">temps = data[<span class="st">'temperature'</span>]
<span class="kw">for</span> temp_in_f in temps:
temp_in_c = (temp_in_f - <span class="dv">32</span>) * <span class="dv">5</span> / <span class="fl">9.0</span>
<span class="dt">print</span> temp_in_c</code></pre>
<pre class="output"><code>26.6666666667
29.4444444444
30.0
30.5555555556
23.3333333333
23.8888888889
26.6666666667
29.4444444444
23.3333333333
23.3333333333
</code></pre>
<p>That looks good, but why did we use 9.0 instead of 9? The reason is that computers store integers and numbers with decimals as different types: integers and floating point numbers (or floats). Addition, subtraction and multiplication work on both as we’d expect, but division works differently. If we divide one integer by another, we get the quotient without the remainder:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> <span class="st">'10/3 is:'</span>, <span class="dv">10</span> / <span class="dv">3</span></code></pre>
<pre class="output"><code>10/3 is: 3
</code></pre>
<p>If either part of the division is a float, on the other hand, the computer creates a floating-point answer:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span> <span class="st">'10/3.0 is:'</span>, <span class="dv">10</span> / <span class="fl">3.0</span></code></pre>
<pre class="output"><code>10/3.0 is: 3.33333333333
</code></pre>
<p>The computer does this for historical reasons: integer operations were much faster on early machines, and this behavior is actually useful in a lot of situations. However, it’s still confusing, so Python 3 produces a floating-point answer when dividing integers if it needs to. We’re still using Python 2.7 in this class, so if we want 5/9 to give us the right answer, we have to write it as 5.0/9, 5/9.0, or some other variation.</p>
<h2 id="conditionals">Conditionals</h2>
<p>The other standard thing we need to know how to do in Python is conditionals, or if/then/else statements. In Python the basic syntax is:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="kw">if</span> condition:
do_something</code></pre>
<p>So if we want to loop over the temperatures and print out only those temperatures that are greater than 80 degrees we would use:</p>
<pre class="sourceCode python"><code class="sourceCode python">temp = data[<span class="st">'temperature'</span>][<span class="dv">0</span>]
<span class="kw">if</span> temp > <span class="dv">75</span>:
<span class="dt">print</span> <span class="st">"The temperature is greater than 75"</span></code></pre>
<pre class="output"><code>The temperature is greater than 75
</code></pre>
<p>We can also use <code>==</code> for equality, <code><=</code> for less than or equal to, <code>>=</code> for greater than or equal to, and <code>!=</code> for not equal to.</p>
<p>Additional conditions can be handled using <code>elif</code> and <code>else</code>:</p>
<pre class="sourceCode python"><code class="sourceCode python">temp = data[<span class="st">'temperature'</span>][<span class="dv">0</span>]
<span class="kw">if</span> temp < <span class="dv">80</span>:
<span class="dt">print</span> <span class="st">"The temperature is < 80"</span>
<span class="kw">elif</span> temp > <span class="dv">80</span>:
<span class="dt">print</span> <span class="st">"The temperature is > 80"</span>
<span class="kw">else</span>:
<span class="dt">print</span> <span class="st">" The temperature is equal to 80"</span></code></pre>
<pre class="output"><code> The temperature is equal to 80
</code></pre>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge-1"><span class="glyphicon glyphicon-pencil"></span>Challenge</h2>
</div>
<div class="panel-body">
<p>Import the data from <code>A2_mosquito_data.csv</code>, determine the mean temperate, and loop over the temperature values. For each value print out whether it is greater than the mean, less than the mean, or equal to the mean.</p>
</div>
</section>
<h2 id="plotting">Plotting</h2>
<p>The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers,” and the best way to develop insight is often to visualize data. The main plotting library in Python is <code>matplotlib</code>. To get started, let’s tell the IPython Notebook that we want our plots displayed inline, rather than in a separate viewing window:</p>
<pre class="sourceCode python"><code class="sourceCode python">%matplotlib inline</code></pre>
<p>The <code>%</code> at the start of the line signals that this is a command for the notebook, rather than a statement in Python. Next, we will import the <code>pyplot</code> module from <code>matplotlib</code>, but since <code>pyplot</code> is a fairly long name to type repeatedly let’s give it an alias.</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="ch">from</span> matplotlib <span class="ch">import</span> pyplot <span class="ch">as</span> plt</code></pre>
<p>This import statement shows two new things. First, we can import part of a library by using the <code>from library import submodule</code> syntax. Second, we can use a different name to refer to the imported library by using <code>as newname</code>.</p>
<p>Now, let’s make a simple plot showing how the number of mosquitos varies over time. We’ll use the site you’ve been doing exercises with since it has a longer time-series.</p>
<pre class="sourceCode python"><code class="sourceCode python">data = pandas.read_csv(<span class="st">'A2_mosquito_data.csv'</span>)
plt.plot(data[<span class="st">'year'</span>], data[<span class="st">'mosquitos'</span>])</code></pre>
<pre class="output"><code>[<matplotlib.lines.Line2D at 0x7f0dc3da18d0>]</code></pre>
<div class="figure">
<img src="fig/01-intro-python_66_1.png" alt="Number of mosquitoes through time" />
<p class="caption">Number of mosquitoes through time</p>
</div>
<p>More complicated plots can be created by adding a little additional information. Let’s say we want to look at how the different weather variables vary over time.</p>
<pre class="sourceCode python"><code class="sourceCode python">plt.figure(figsize=(<span class="fl">10.0</span>, <span class="fl">3.0</span>))
plt.subplot(<span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">1</span>)
plt.plot(data[<span class="st">'year'</span>], data[<span class="st">'temperature'</span>], <span class="st">'ro-'</span>)
plt.xlabel(<span class="st">'Year'</span>)
plt.ylabel(<span class="st">'Temperature'</span>)
plt.subplot(<span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">2</span>)
plt.plot(data[<span class="st">'year'</span>], data[<span class="st">'rainfall'</span>], <span class="st">'bs-'</span>)
plt.xlabel(<span class="st">'Year'</span>)
plt.ylabel(<span class="st">'Rain Fall'</span>)</code></pre>
<div class="figure">
<img src="fig/01-intro-python_68_0.png" alt="Temperature and rainfall through time" />
<p class="caption">Temperature and rainfall through time</p>
</div>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge-2"><span class="glyphicon glyphicon-pencil"></span>Challenge</h2>
</div>
<div class="panel-body">
<p>Using the data in <code>A2_mosquito_data.csv</code> plot the relationship between the number of mosquitos and temperature and the number of mosquitos and rainfall.</p>
</div>
</section>
<h3 id="key-points">Key Points</h3>
<ul>
<li>Import a library into a program using <code>import libraryname</code>.</li>
<li>Use the <code>pandas</code> library to work with data tables in Python.</li>
<li>Use <code>variable = value</code> to assign a value to a variable.</li>
<li>Use <code>print something</code> to display the value of <code>something</code>.</li>
<li>Use <code>dataframe['columnname']</code> to select a column of data.</li>
<li>Use <code>dataframe[start_row:stop_row]</code> to select rows from a data frame.</li>
<li>Indices start at 0, not 1.</li>
<li>Use <code>dataframe.mean()</code>, <code>dataframe.max()</code>, and <code>dataframe.min()</code> to calculate simple statistics.</li>
<li>Use <code>for x in list:</code> to loop over values</li>
<li>Use <code>if condition:</code> to make conditional decisions</li>
<li>Use the <code>pyplot</code> library from <code>matplotlib</code> for creating simple visualizations.</li>
</ul>
<h2 id="next-steps">Next steps</h2>
<p>With the requisite Python background out of the way, now we’re ready to dig in to analyzing our data, and along the way learn how to write better code, more efficiently, that is more likely to be correct.</p>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>