# FLIP (00): Data Science 
**(Module 00: Python Basics)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use,but NOT allowed to change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---


# Session 4 Python Packages and Your Data

In this week, we will learn how to use Python Packages to manipulate the data and files.

Please note that:

1. Some of the Packages might not work in Python 3;
1. If the code doesn't work, you can either change the code (or your own) in Python 3, or you can create a Python 2 Environment in Anaconda, and get the codes running there.

## Content



### Part 1 Python packages

1.1 [Standard Libary](#standlib)

1.2 [Third Party Packages](#3rdparty)

1.3 [How to Install a Package](#installpack)

1.4 [Importing a module](#importmod) 


### Part 2 Python Simple IO

2.1 [Input](#input)

2.2 [Output](#output)


### Part 3 Datetime Module

3.1 [Time](#time)

3.2 [Date](#date)

3.3 [Timedelta](#timedelta)

3.4 [Formatting and Parsing](#parsing)

### Part 4 Tweeter API

4.1 [Search Tweeter](#tweeter)

4.2 [Geo-Visualization](#geo)


### Part 5 Numpy Module

5.1 [Importing Numpy](#importnp)

5.2 [Numpy arrays](#nparray)

5.3 [Manipulating arrays](#maninp)

5.4 [Array Operations](#arrayop)

5.5 [np.random](#random)

5.6 [Vectorizing Functions](#vecfunc)



### Part 6 Data Loading

6.1 [TXT](#txt)

6.2 [CSV](#csv)

6.3 [JSON](#json)



---
## <span style="color:#0b486b">1. Python packages</span>

After completing previous Python sessions, you should know about the syntax and semantics of the Python language. But apart from that, you should also learn about Python libraries and its packages to be able to code efficiently. Python’s standard library is very extensive, offering a wide range of facilities as indicated [here](https://docs.python.org/2/library/). The library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming. Look at the [Python Standard Library Manual](https://docs.python.org/2/library/) to read more.

In addition to the standard library, there is a growing collection of several thousand components (from individual programs and modules to packages and entire application development frameworks), available from the [Python Package Index](https://pypi.python.org/pypi).

<a id = "standlib"></a>

### <span style="color:#0b486b">1.1 Standard libraries</span>

For a complete list of Python standard library and their documentation look at the [Python Manual.](https://docs.python.org/2/library/) A few to mention are:

* ``math`` for numeric and math-related functions and data types
* ``urllib`` for fetching data across the web
* ``datetime`` for manipulating dates and times
* ``pickle`` and ``cPickle`` for serializing and deserializing data structures enabling us to save our variables on the disk and load them from the disk
* ``os`` for os dependent functions

<a id = "3rdparty"></a>

### <span style="color:#0b486b">1.2 Third party packages</span>

There are thousands of third party packages, each developed for a special task. Some of the useful libraries for data science are:

* ``numpy`` is probably the most fundamental package for efficient scientific computing in Python
* ``scipy`` is one of the core packages for scientific computations
* ``pandas`` is a library for operating with table-like data structures called DataFrame object
* ``matplotlib`` is a comprehensive plotting library
* ``BeautifulSoup`` is an HTML and XML parser
* ``scikit-learn`` is the most general machine learning library for Python
* ``nltk`` is a toolkit for natural language processing

<a id = "installpack"></a>
### <span style="color:#0b486b">1.3 How to install a package</span>

The easiest way to install a package is using `conda` (if you are using Anaconda) or `pip` commands. Suppose you want to install the package `NLTK`. Either:
    
    > conda install nltk
    
or    
    
    > pip install nltk
    
    
will install the package.

---
<a id = "importmod"></a>
### <span style="color:#0b486b">1.4 Importing a module</span>

To use a module, first you have to ``import`` it. There are different ways to import a module:

* `import my_module`
* `from my_module import my_function`
* `from my_module import my_function as func`
* `from my_module import submodule`
* `from my_module import submodule as sub`
* `from my_module import *`

**`'import my_module'`** imports the module `'my_module'` and creates a reference to it in the namespace. For example `'import math'` imports the module `'math'` into the namespace. After importing the module this way, you can use the dot operator `(.)` to refer to the objects defined in the module. For example `'math.exp()'` refers to function `'exp()'` in module `'math'`.

In [None]:
import math

x = 2
y1 = math.exp(x)
y2 = math.log(x)

print( "e^{} is {} and log({}) is {}" .format(x, y1, x, y2))


**`'from my_module import my_function'`** only imports the function `'my_function'` from the module `'my_module'` into the namespace. This way you won't have access to neither the module (since you have not imported the module), nor the other objects of the module. You can only have access to the object you have imported.

You can use a comma to import multiple objects.

In [None]:
from math import exp

x = 2
y = exp(x)  # no need to math.exp()

print "e^{} is {}".format(x, y)

**`'from my_module import my_function as func'`** imports the function `'my_function'` from module `'my_module'` but its identifier in the namespace is changed into `'func'`. This syntax is used to import submodules of a module as well. For example later you will see that nowadays it is almost a convention to import matplotlib.pyplot as plt.

In [None]:
# you can change the name of the imported object
from math import exp as myfun

x = 2
y = myfun(x)

print "e^{} is {}".format(x, y)

**`'from my_module import *'`** imports all the public objects defined in `'my_module'` into the namespace. Therefore after this statement you can simply use the plain name of the object to refer to it and there is no need to use the dot operator:

In [None]:
from math import *

x = 2
y1 = exp(x)
y2 = log(x)

print "e^{} is {} and log({}) is {}".format(x, y1, x, y2)

**Exercise1:** 

1. import the library `math` from standard Python libraries
2. define a variable and assign an integer value to it (smaller than 20)
3. use `factorial()` function (an object in `math` library) to calculate the factorial of the variable
4. print its value

In [None]:
# your code here
import math
n = 10
print( math.factorial(10))

**Exercise2:**

1. write a function that takes an integer variable and returns its factorial
2. use it to find the factorial of the variable defined in Exercise1
3. do your answeres match?

In [None]:
# your code here
def my_factorial(n):
    if n==1:
        return 1
    else:
        return n * my_factorial(n-1)
    
print my_factorial(10)

---

## <span style="color:#0b486b">2. Python simple input/output</span>

<a id = "input"></a>

### <span style="color:#0b486b">2.1 Input</span>

`raw_input()`(used in python2, python3 only has `input()` ) asks the user for a string of data (ended with a newline), and simply returns the string.

In [None]:
x = raw_input('What is your name? ')

print "x is {}".format(type(x))
print "Your name is {}".format(x)

**Exercise3:**

1. use `raw input()` to take a float value between -1 and 1 from the user
2. use the function `acos()` from `math` to find the arc cosine of it
3. print the value of the variable and its arc cosine

In [None]:
# your code here
x = raw_input('Enter a real number between -1 and 1: ')
y = math.acos(float(x))
print 'acos({}) = {}'.format(x,y)

As we know the domain of [arc cosine function][acos] is [-1, 1]. So, what if the value entered by the user is not in the domain (the value is smaller than -1 or greater that 1)? What happens then? 

To avoid raising a ValueError exception, before passing the value to `acos()` function make sure it is in range and if not, display an appropriate message.

[acos]: http://mathworld.wolfram.com/InverseCosine.html

In [None]:
# your code here
x = raw_input('Enter a real number between -1 and 1: ')
x = float(x)
if x>=-1 and x<=1:
    y = math.acos(x)
    print 'acos({}) = {}'.format(x,y)
else:
    print 'Out of range'

<a id = "output"></a>

### <span style="color:#0b486b">2.2 output</span>

The basic way to do output is the print statement. To print multiple things on the same line separated by spaces, use commas between them.

In [None]:
name = "John"
msg = "Hello"

print msg
print msg, name

Objects can be printed on the same line without needing to be on the same line if one puts a comma at the end of a print statement:

In [None]:
for i in range(10):
    print i,

---
## <span style="color:#0b486b">3. datetime module</span>


The datetime module includes functions and classes for date and time parsing, formatting, and arithmetic.

<a id = "time"></a>

### <span style="color:#0b486b">3.1 Time</span>

Time values are represented with the time class. Times have attributes for hour, minute, second, and microsecond. They can also include time zone information.

In [None]:
import datetime

t = datetime.time(11, 21, 33)
print t
print 'hour  :', t.hour
print 'minute:', t.minute
print 'second:', t.second
print 'microsecond:', t.microsecond
print 'tzinfo:', t.tzinfo

<a id = "date"></a>

### <span style="color:#0b486b">3.2 Date</span>

Calendar date values are represented with the date class. Instances have attributes for year, month, and day.

In [None]:
import datetime

today = datetime.date.today()
print today
print 'ctime:', today.ctime()
print 'tuple:', today.timetuple()
print 'ordinal:', today.toordinal()
print 'Year:', today.year
print 'Mon :', today.month
print 'Day :', today.day

A way to create new date instances is using the `replace()` method of an existing date. For example, you can change the year, leaving the day and month alone.

In [None]:
import datetime

d1 = datetime.date(2013, 3, 12)
print 'd1:', d1

d2 = d1.replace(year=2015)
print 'd2:', d2

**Exercise4:**

1. Write a piece of code that gives you the day of the week that you were born.
2. How about thisn year? Do you know what day of the week is it?

In [None]:
# your coede here
day_of_week = {0 : 'Monday',
              1: 'Tuesday',
              2: 'Wednesday',
              3: 'Thursday',
              4: 'Friday',
              5: 'Saturday',
              6: 'Sunday'}
# you could also use a list to store the days o the week
# and it would work just fine.
# days_of_week = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

print 'Today is', day_of_week[datetime.date.today().weekday()]

my_birthdate = datetime.date(1980,10,10)
print 'I was born on', day_of_week[my_birthdate.weekday()]

t2 = my_birthdate.replace(year=2015)
print 'and my birthday this year is on a', day_of_week[t2.weekday()]

<a id = "timedelta"></a>

### <span style="color:#0b486b">3.3 timedelta</span>

Using `replace()` is not the only way to calculate future/past dates. You can use datetime to perform basic arithmetic on date values via the timedelta class. 

In [None]:
today = datetime.datetime.today()
print today

tomorrow = today + datetime.timedelta(days=1)
print tomorrow

**Exercise5:**

Rewrite exercise4 with timedelta.

In [None]:
# your code here

You can use comparison operators for datetime objects too. It makes sense right?

In [None]:
tomorrow > today

<a id = "parsing"></a>

### <span style="color:#0b486b">3.4 Formatting and Parsing</span>

The default string representation of a datetime object uses the ISO 8601 format (YYYY-MM-DDTHH:MM:SS.mmmmmm). Alternate formats can be generated using `strftime()`. Similarly, if your input data includes timestamp values parsable with `time.strptime()`, then `datetime.strptime()` is a convenient way to convert them to datetime instances.

In [None]:
today = datetime.datetime.today()
print 'ISO     :', today

string from datetime object

In [None]:
str_format = "%a %b %d %H:%M:%S %Y"
s = today.strftime(str_format)
print 'strftime:', s

datetime object from string

In [None]:
print s

d = datetime.datetime.strptime(s, str_format)
print d
print 'strptime:', d.strftime(str_format)

In [None]:
s = "07/03/2017"
str_format = "%m/%d/%Y"

d = datetime.datetime.strptime(s, str_format)
print d

**Exercise6:**

You have a string as "7/30/2017 - 12:13". How do you convert it into a datetime object?

In [None]:
# your code here
s = '7/30/2017 - 12:13'
str_format = "%m/%d/%Y - %H:%M"
t = datetime.datetime.strptime(s,str_format)
print t

---
## <span style="color:#0b486b">4. Twitter Data and Visualization</span>

To work with Twitter API, we use a package called `TwitterAPI`. You can install it by executing the cell below if you don't have it on your machine.

In [None]:
!pip install -U -I TwitterAPI

To be able to collect data from the Twitter API you need an Access token and secret. For now we have provided you with them, but to obtain yours you can go to https://apps.twitter.com/, click on 'Create New App', fill the form and then click on 'Create your Twitter Application'.

In [None]:
!pip install TwitterAPI
from TwitterAPI import TwitterAPI

In [None]:
from TwitterAPI import TwitterAPI

CONSUMER_KEY = "YOUR-KEY-HERE"
CONSUMER_SECRET = "YOUR-SECRETE-HERE"
OAUTH_TOKEN = "YOUR-TOKEN-HERE"
OAUTH_TOKEN_SECRET = "YOUR-TOKEN-SECRETE-HERE"

# Authonticating with your application credentials
api = TwitterAPI(CONSUMER_KEY,
                 CONSUMER_SECRET,
                 OAUTH_TOKEN,
                 OAUTH_TOKEN_SECRET)

Now we have access to API. For a complete reference on what the API offers look at the [Twitter API documentation](https://dev.twitter.com/overview/api). For example we can search for tweets that contain a specific keyword or collect tweets from the Twitter stream. Twitter responses are in JSON format which we can easily parse into Python dictionary object.

<a id = "tweeter"></a>
### <span style="color:#0b486b">4.1 Search Tweets</span>

You can query Twitter with a keyword:

In [None]:
resp = api.request('search/tweets', {'q':'deakin'})

In [None]:
resp

Iterate over the reponse to print the Twitter message:

In [None]:
for r in resp:
    print r['text']

**Exercise7:**

1. Select a keyword and crawl some tweets from Twitter containing that keyword and then print them.
2. Crawl 100 tweets containing this keyword and print them. Maybe you want to check Twitter API documentation first.

In [None]:
# code here

There are other parameters that you can set to restrict the response. For example the language of the tweets, or geographical location.

In [None]:
# result_type: popular, recent, mixed
# geocode: lat,long,radius

# geo coordinations of the desired place
my_lat = 51.5;
my_long = 0.12;

resp = api.request('search/tweets', {'q':'house', 
                                     'count':'100', 
                                     'lang':'en', 
                                     'result_type':'recent',
                                     'geocode':'{},{},100mi'.format(my_lat, my_long)})
for r in resp:
    print r['text']

**Exercise8:**

check out the API documentation and narrow down your search results for Exercise1 using parameters other that keyword.

In [None]:
# code here

Apart from the tweet text, you can retrieve other metadata from the Twitter response. For example the user who sent the tweet, whether the tweet is in reply to another user or is a retweet, how many times it is retweeted and so on. Since the response is parsed into a dictionary, use `keys()` function to see the fields that are available:

In [None]:
response = resp.json()

In [None]:
response['statuses'][0]['user']

In [None]:
response['statuses'][0].keys()

**Exercise9:**

print user, place, and geo locations of tweets you have collected.

In [None]:
# Put your code here

<a id = "geo"></a>
### <span style="color:#0b486b">4.2 Geo-Visualization</span>


By now you should be aware of the concept of geo-tagged data. Data such as photos that you take with your cell phone (assuming the GPS and geo-tagging on your phone is activated), tweets that you send, and etc. In this section we intend to visualize geographical information. We will use JavaScript and Google Maps API to show them as points on a map. First we need some geo-tagged data. Let's use those crawled tweets.

Although we have specified the geo-location in our Twitter query, not all the tweets in `response object` actually are geo-tagged. We remove them and keep, username, tweet message, and lat-long of the tweet.

In [None]:
clean_data = []

for r in resp:
    data = []
    try:
        user_name = r['user']['name'].encode('ascii', 'ignore')    # username
        tweet_text = r['text'].encode('ascii', 'ignore')            # tweet message
        data.append("{}: {}".format(user_name, tweet_text))
        data.append(r['geo']['coordinates'][0])    # lat
        data.append(r['geo']['coordinates'][1])    # long
        clean_data.append(data)
    except TypeError, e:
        print "lat, long not availabe. "

In [None]:
clean_data

Now our data is ready for visualization.

#### <span style="color:#0b486b">4.2.1 magic functions</span> 


Before we move on it is better to introduce `magic` functions. IPython magic functions allow you to control the behaviour of the IPython itself and a lot of system features. Any line whose first character is % is considered to be a magic function.

In [None]:
%cd

In [None]:
%timeit x = range(1000)

`magic` cell:

In [None]:
%%javascript

alert("Hello World!")

#### <span style="color:#0b486b">4.2.2 Using Google Maps API</span> 


`core.display` module offers top-level functions for displaying objects in different formats in Ipython Notebook. We use `HTML` and `Javascript` which create a HTML and JavaScript representation of the object respectively. Also note the double percentage sign (%%) at the beginning of cells. %% and the following term are called `magic functions` and cause the cell to behave differently. For example %%javascript at the beginning of a cell, runs the cell block of Javascript code.

In [None]:
from IPython.core.display import HTML, Javascript

In [None]:
# load the Google Maps API library

def gmap_init():
    js = """
window.gmap_initialize = function() {};
$.getScript('https://maps.googleapis.com/maps/api/js?v=3&sensor=false&callback=gmap_initialize');
    """
    return Javascript(data=js)

gmap_init()

In [None]:
%%html
<style type="text/css">
  .map-canvas { height: 400px; }
</style

In [None]:
%%html
<div id="markers" class="map-canvas"/>

In [None]:
def myfun(data, center_lat, center_long):
    
    js = "var data = " + str(data)
    js += """

var map = new google.maps.Map(document.getElementById('markers'),
                              {{zoom: 10,
                               center: new google.maps.LatLng({}, {}),
                               mapTypeId: google.maps.MapTypeId.ROADMAP
                              }});
""".format(center_lat, center_long)
    js += """                              

var infowindow = new google.maps.InfoWindow();

var i;
for (i = 0; i < data.length; i++) {
    var marker = new google.maps.Marker({
        position: new google.maps.LatLng(data[i][1], data[i][2]),
        map: map});

    google.maps.event.addListener(marker, 'click', (function(marker, i) {
        return function() {
            infowindow.setContent(data[i][0]);
            infowindow.open(map, marker);
        }
    })(marker, i));
}
"""
    return Javascript(js)

In [None]:
myfun(clean_data, my_lat, my_long)

---
## <span style="color:#0b486b">5. Numpy module</span>


Python lists are very flexible for storing any sequence of Python objects. But usually flexibility comes at the price of performance and therefore Python lists are not ideal for numerical calculations where we are interested in performance. Here is where **NumPy** comes in. It adds support for large, multi-dimensional arrays and matrices, along with high-level mathematical functions to operate on these arrays to Python. 

Relying on `'BLAS'` and `'LAPACK'`, `'NumPy'` gives a functionality comparable with `'MATLAB'` to Python. NumPy facilitates advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences. It has become one of the fundamental packages used for numerical computations.

In this tutorial we will review its basics, so to learn more about NumPy, visit [NumPy User Guide](http://docs.scipy.org/doc/numpy/user/index.html)

<a id = "importnp"></a>

### <span style="color:#0b486b">5.1 Importing Numpy</span>

As you have learnt in this session, first we have to import a package to be able to use it. NumPy is imported with:

In [None]:
import numpy

Although it is the convention to import it like with an alias:

In [None]:
import numpy as np

<a id = "nparray"></a>

### <span style="color:#0b486b">5.2 Numpy arrays</span>

The core of NumPy is its arrays. You can create an array from a Python list or tuple using `'array'` function. They work similarly to lists apart from the fact that:

* you can easily perform element-wise operation on them, and
* unlike lists, they should be pre-allocated.

The first point is ufrther explained in [Array operations section](03-prac3.ipynb#Array-operations). The second point means that you there is no equivalent to list append for arrays. The size of the arrays is known at the time it is defined.

#### <span style="color:#0b486b">5.2.1 create an array from a list</span>

In [None]:
x = [1, 7, 3, 4, 0, -5]


In [None]:
y = np.array(x)
type(y)

#### <span style="color:#0b486b">5.2.2 create an array using a range</span>

In [None]:
range(5)

In [None]:
print np.array(range(5))

In [None]:
print np.arange(2, 3, 0.2)   

In [None]:
print np.linspace(2, 3, 5)    # returns numbers spaced evenly on a linear scale, both endspoints are included

In [None]:
print np.logspace(2, 3, 5)    # returns numbers spaced evenly on a log scale

**Note:** If you need any help on how to use a function or what it does, you can IPython help. Just add a question mark (?) at the end of the function and execute the cell:

In [None]:
np.logspace?

#### <span style="color:#0b486b">5.2.3 create a prefilled array</span>

In [None]:
print np.zeros(5)

In [None]:
print np.ones(5, dtype=int)    # you can specify the data type, default is float

#### <span style="color:#0b486b">5.2.4 `'mgrid'`</span>
similar to meshgrid in MATLAB:

In [None]:
x, y = np.mgrid[0:5, 0:3]

print x
print y

#### <span style="color:#0b486b">5.2.5 array attributes</span>
NumPy arrays have multiple attributes and methods. The cell below shows a few of them. You can press tab after typing the dot operator `'(.)'` to use IPython auto-complete and see the rest of them.

In [None]:
y = np.array([3, 0, -4, 6, 12, 2])

In [None]:
print "number of dimensions:\t", y.ndim        
print "dimension of the array:", y.shape       
print "numerical data type:\t", y.dtype
print "maximum of the array:\t", y.max()       
print "index of the array max:", y.argmax()    
print "mean of the array:\t", y.mean()      

#### <span style="color:#0b486b">5.2.6 Multi-dimensional arrays</span>


You can define arrays with 2 (or higher) dimensions in numpy:

##### from lists

In [None]:
x = [[1, 2, 10, 20], [3, 4, 30, 40]]
y = np.array(x)
print y
print
print y.ndim, y.shape

##### pre-filled 

In [None]:
x = np.ones((3, 5), dtype='int')

In [None]:
print x
print 
print x.ndim, x.shape

##### `'diag()'`
diagonal matrix

In [None]:
np.diag([1, 2, 3])

<a id = "maninp"></a>

### <span style="color:#0b486b">5.3 Manipulating arrays</span>


#### <span style="color:#0b486b">5.3.1 Indexing</span>


Similar to lists, you can index elements in an array using `'[]'` and indices:

If `'x'` is a 1-dimensional array, `'x[i]'` will index `'ith'` element of `'x'`:

In [None]:
x = np.array([2, 8, -2, 4, 3])
print x[3]

If 'x' is a 2-dimensional arrray:

* '`x[i, j]'` or `'x[i][j]'` will index the element in `'ith'` row and `'jth'` column
* '`x[i, :]'` will index the `'ith'` row 
* `'x[:, j]'` will index `'jth'` column

In [None]:
x = np.array([[7, 6, 8, 6, 4],
              [4, 7, -2, 0, 9]])
              
print x[1, 3]

In [None]:
print x[1, :]      # or x[1]

In [None]:
print x[:, 3]

Arrays can also be indexed with other arrays:

In [None]:
x = np.array([2, 8, -2, 4, 3, 9, 0])

idx1 = [1, 3, 4]        # list
idx2 = np.array(idx1)   # array

print x[idx1], x[idx2]
x[idx2] = 0
print x

You can also index masks. The index mask should be a NumPy arrays of data type Bool. Then the element of the array is selected only if the index mask at the position of the element is True.

In [None]:
x = np.array([2, 8, -2, 4, 3, 9, 0])

In [None]:
mask = np.array([False, True, True, False, False, True, False])

In [None]:
x[mask]

Combining index masks with comparison operaors enabels you to conditinoally slecect elements of the array.

In [None]:
x = np.array([2, 8, -2, 4, 3, 9, 0])
mask = (x>=2) * (x<9)
x[mask]

#### <span style="color:#0b486b">1.3.2 Slicing</span>


Similar to Python lists, arrays can also be sliced:

In [None]:
x = np.array([2, 8, -2, 4, 3, 9, 0])

print x[3:]    # slicing
print x[3:7:2]  # slicing with a specified step

In [None]:
x = np.array([[7, 6, 8, 6, 4, 3],
              [4, 7, 0, 5, 9, 5],
              [7, 3, 6, 3, 5, 1]])
              

print x[1, 1:4]
print
print x[:2, 1::2]    # rows zero up to 2, cols 1 up to end with a step=2

#### <span style="color:#0b486b">5.3.3 Iteration over items</span>


Since most of NumPy functions are capable of operating on arrays, in many cases iteration over items of an arrays can be (and should be) avoided. Otherwise it is pretty much similar to iterating over values of a list:

In [None]:
a = np.arange(0, 50, 7)
print a
for item in a:
    print item, 

Of course you could iterate over items using their indices too:

In [None]:
a = np.arange(0, 50, 7)
for i in xrange(a.shape[0]):
    print a[i],

There are also many functions for manipulating arrays. The most used ones are:

#### <span style="color:#0b486b">5.3.4 `copy()`</span>


**Remember** that assignment operator is not an equivalent for copying arrays. In fact Python does not pass the values. It passess the references.

In [None]:
x = [1, 2, 3]
y = x
print x, y

In [None]:
y[0] = 0       # now we alter an element of y
print x, y     # note that x has changed as well

Same is true for numpy arrays. That's why if you need a copy of an array, you should use `'copy()'` function.

In [None]:
x = np.array([1, 2, 3])
y = x

y[0] = 0       # now we alter an element of y
print x, y     # note that x has changed as well

In [None]:
x = np.array([1, 2, 3])
y = x.copy()  # or np.copy(x)
y[0] = 0

print x, y

#### <span style="color:#0b486b">5.3.5 `reshape()`</span>


In [None]:
x1 = np.arange(6)
x2 = x1.reshape((2, 3))    # or np.reshape(x1, (2, 3))

print x1
print
print x2

#### <span style="color:#0b486b">5.3.6 `astype()`</span>


Used for type casting:

In [None]:
x1 = np.arange(5)
x2 = x1.astype(float)

print type(x1), x1
print type(x2), x2

#### <span style="color:#0b486b">5.3.7 `T`</span> 

transpose method:

In [None]:
x1 = np.random.randint(5, size=(2, 4))
x2 = x1.T

print x1
print
print x2

<a id = "arrayop"></a>

### <span style="color:#0b486b">5.4 Array operations</span>


#### <span style="color:#0b486b">5.4.1 Arithmetic operators</span>


Arrays can be added, subtracted, multiplied and divided using +, -, \* and, /. Operations done by these operators are **element wise**.

In [None]:
x1 = np.array([[2, 3, 5, 7], 
               [2, 4, 6, 8]], dtype=float)
x2 = np.array([[6, 5, 4, 3], 
               [9, 7, 5, 3]], dtype=float)

print x1
print
print x2

In [None]:
print x1 + x2

In [None]:
print x1 - x2

In [None]:
print x1 * x2

In [None]:
print x1 / x2

In [None]:
print 3 + x1

In [None]:
print 3 * x1

In [None]:
print 3 / x1

#### <span style="color:#0b486b">5.4.2 Boolean operators</span>

Much like arethmaic operators discussed above, boolean (comparison) operators perform element-wise on arrays.

In [None]:
x1 = np.array([2, 3, 5, 7])
x2 = np.array([2, 4, 6, 7])
y = x1<x2

print y, y.dtype

use methods `'.any()'` and `'.all()'` to return a single boolean value indicating whether any or all values in the array are True respectively. This value in turn can be used as a condition for an `'if'` statement.

In [None]:
print y.all()
print y.any()

NumPy has many other functions that you can read about them in [NumPy User Guide](http://docs.scipy.org/doc/numpy/user/). Specially read about:

* `np.unique`, returns unique elements of an array
* `np.flatten`, flattens a multi-dimensional array
* `np.mean`, `np.std`, `np.median`
* `np.min`, `np.max`, `np.argmin`, `np.argmax`

<a id = "random"></a>

### <span style="color:#0b486b">5.5 np.random</span>


NumPy has a module called `random` to generate arrays of random numbers. There are different ways to generate a random number:

In [None]:
print np.random.rand()

In [None]:
# 2x5 random array drawn from standard normal distribution
print np.random.random([2, 5])

In [None]:
# 2x5 random array drawn from standard normal distribution
print np.random.rand(2, 5)

In [None]:
# 2x5 random array drawn from a uniform distribution on {0, 1, 2, ..., 9}
print np.random.randint(10, size=[2, 5]) 

##### <span style="color:#0b486b">5.5.1 Random seed</span>


Random numbers generated by computers are not really random. They are called pseudo-random. Thus we can set the random generator to generate the same set of random numbers every time. This is useful while testing the code.

In [None]:
for i in range(5):
    print np.random.random(),    

In [None]:
for i in range(5):
    np.random.seed(100)
    print np.random.random(),    

<a id = "vecfunc"></a>

### <span style="color:#0b486b">5.6 Vectorizing functions</span>


As mentioned earlier in operators, to get a good performance you should avoid looping over elements in an array and use vectorized algorithms. Many methods and functions of NumPy already support vectors, so keep this in mind while writing your own code.

But for now, suppose you have written a step function which does not work with arrays, as the cell below:

In [None]:
def step_func(x):
    """
    scalar implementation of step function
    """
    
    if x>=0:
        return 1
    else:
        return 0

Obviously it fails when dealing with an array, because it expects a scalar as its input. Execute the cell below and see that it raises an error:

In [None]:
# since step_func expects a scalar and recieves an array instead, 
# it raises an error

step_func(np.array([2, 7, -4, -9, 0, 4]))

You can use the function `'np.vectorize()'` to obtain a vectorized version of `'step_func'` that can handle vector data:

In [None]:
step_func_vectorized = np.vectorize(step_func)
step_func_vectorized(np.array([2, 7, -4, -9, 0, 4]))

Although `'vectorize()'` can automatically derive a vectorized version of a scalar function, but it is always better to keep this in mind and write functions vector-compatilbe, from the beginning. For example we could write the step function as it is shown in the cell below, so it can handle scalar and vector data.

In [None]:
def step_func2(x):
    """
    vector and scalar implementation of step function
    """
    
    return 1 * (x>=0)

In [None]:
step_func2(np.array([2, 7, -4, -9, 0, 4]))

---
## <span style="color:#0b486b">6. FIle I/O</span>

<a id = "txt"></a>
### <span style="color:#0b486b">6.1 TXT</span>


TXT file format is the most simplestic way to store data. 

Load a TXT file with `'np.loadtxt()'`:

In [None]:
import numpy as np
x = np.loadtxt("data/txt_data1.txt")
x

Save a TXT file with `'np.savetxt()'`:

In [None]:
y = np.random.randint(10, size=5)
np.savetxt("data/txt_data2.txt", y)
y

<a id = "csv"></a>
### <span style="color:#0b486b">6.2 CSV</span>



Comma Separated Values format and its variations, are one the most used file format to store data.

You can use `'np.genfromtxt()'` to read a CSV file:

**NOTE:** The best way to read CSV and XLS files is using **pandas** package that will be introduced later.

In [None]:
x = np.genfromtxt("data/csv_data1.csv", delimiter=",")
x

Use `'np.savetxt()'` to save a 2d-array in a CSV file.

In [None]:
x = np.random.randint(10, size=(6,4))
np.savetxt("data/csv_data2.csv", x, delimiter=',')
x

<a id = "json"></a>
### <span style="color:#0b486b">6.3 JSON</span>


JSON is the most used file format when dealing with web services. 

To read a JSON file, use `'json'` package and `'load()'` function, or `'loads()'` if the data is serialized. It reads the data and parses it into a dictionary.

In [None]:
import json
with open("data/json_data1.json", 'rb') as fp:
    fcontent = fp.read()
data = json.loads(fcontent)
data.keys()

In [None]:
data


In [None]:
data['phoneNumbers']

You can also write a python dictionary into a JSON file. To do this use `'dump()'` or `'dumps()'` functions.

In [None]:
data = [{'Name': 'Zara', 'Age': 7, 'Class': 'First'}, 
        {'Name': 'Lily', 'Age': 9, 'Class': 'Third'}];
data

In [None]:
with open("data/json_data_now.json", 'wb') as fp:
    json.dump(data, fp)