![Practical Data Science](https://swizzlebooks.com/img/ocl/ds_intro.png)

# What is Data Science?

## Data science ~ computer science + mathematics/statistics + visualization

## Outline of a data science project

* Harvesting
* Cleaning
* Analyzing
* Visualizing
* Publishing

## Open Data Licenses

* Open Data Commons Public Domain Dedication and Licence (ODC PDDL) – Public domain
* Creative Commons CCZero – Public domain
* Open Data Commons Attribution License – Attribution for data(bases)
* Open Data Commons Open Database License(OdbL) - Attribution-ShareAlike for data(bases)


# Data Science vs Data Analytics


# Why Python for Data Analytics?


### Actively used Python tools for Data Analytics

* Pandas
* Numpy
* Matplotlib

### Jupyter Notebook

Jupyter Notebook is a web application that allows us to create and share documents that contain:

   * live code (e.g. Python code)
   * visualizations
   * explanatory text (written in markdown syntax)

Jupyter Notebook is great for the following use cases:

   * learn and try out Python
   * data processing / transformation
   * numeric simulation
   * statistical modeling
   * machine learning
   
Get Jupyter Notebook here: http://jupyter.org

Run Jupyter:

> jupyter notebook

### Anaconda

Anaconda is a free and open source distribution of the Python and R programming languages for data science and machine learning related applications (large-scale data processing, predictive analytics, scientific computing), that aims to simplify package management and deployment.

### Installing required modules

> pip install numpy pandas matplotlib jupyter



In [3]:
# importing the required modules in Python

import pandas as pd
import numpy as np


## pandas.Series

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).

pandas.Series( data, index, dtype, copy)

* **data** - data takes various forms like ndarray, list, constants
* **index** - Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed.
* **dtype** - dtype is for data type. If None, data type will be inferred
* **copy** - Copy data. Default False

In [4]:
data = [1, 3, 5, np.nan, 6, 8]

s = pd.Series(data)

s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

## pandas.DataFrame

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

Features of DataFrame

   * Potentially columns are of different types
   * Size – Mutable
   * Labeled axes (rows and columns)
   * Can Perform Arithmetic operations on rows and columns

pandas.DataFrame( data, index, columns, dtype, copy)

* **data** - data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.
* **index** - For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed.
* **columns** - For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed.
* **dtype** - Data type of each column.
* **copy** - This is used for copying of data, default is False.

In [5]:
dates = pd.date_range('20180701', periods=6)

dates

DatetimeIndex(['2018-07-01', '2018-07-02', '2018-07-03', '2018-07-04',
               '2018-07-05', '2018-07-06'],
              dtype='datetime64[ns]', freq='D')

In [6]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

df

Unnamed: 0,A,B,C,D
2018-07-01,1.40351,-0.433035,0.88202,-0.879562
2018-07-02,0.388401,-0.15739,0.675737,0.712771
2018-07-03,-0.022809,-0.90498,3.006503,-0.010369
2018-07-04,-0.092016,-0.67192,-0.181144,0.732731
2018-07-05,2.095771,-0.393022,-0.434802,1.38408
2018-07-06,-1.46799,-0.781841,-0.028576,0.599692


In [9]:
# creating a DataFrame by passing a dict..

df2 = pd.DataFrame({ 'A' : [1,2,3,4], 
                    'B' : pd.Timestamp('20130102'), 
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 
                    'D' : np.array([3] * 4,dtype='int32'), 
                    'E' : pd.Categorical(["test","train","test","train"]), 
                    'F' : 'foo' })

df2

Unnamed: 0,A,B,C,D,E,F
0,1,2013-01-02,1.0,3,test,foo
1,2,2013-01-02,1.0,3,train,foo
2,3,2013-01-02,1.0,3,test,foo
3,4,2013-01-02,1.0,3,train,foo


In [10]:
# each column of the above DataFrame has different data types

df2.dtypes

A             int64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [13]:
# viewing data

df.head(2) # first 2 rows from start

Unnamed: 0,A,B,C,D
2018-07-01,1.40351,-0.433035,0.88202,-0.879562
2018-07-02,0.388401,-0.15739,0.675737,0.712771


In [14]:
df.tail(3) # last 3 rows from end

Unnamed: 0,A,B,C,D
2018-07-04,-0.092016,-0.67192,-0.181144,0.732731
2018-07-05,2.095771,-0.393022,-0.434802,1.38408
2018-07-06,-1.46799,-0.781841,-0.028576,0.599692


In [15]:
# Display the index

df.index

DatetimeIndex(['2018-07-01', '2018-07-02', '2018-07-03', '2018-07-04',
               '2018-07-05', '2018-07-06'],
              dtype='datetime64[ns]', freq='D')

In [16]:
# Display columns

df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [17]:
# Underlying NumPy array

df.values

array([[ 1.40350986, -0.43303475,  0.88202044, -0.87956241],
       [ 0.38840065, -0.15739023,  0.67573727,  0.7127706 ],
       [-0.02280897, -0.90498032,  3.00650297, -0.01036861],
       [-0.09201556, -0.67192001, -0.18114442,  0.73273076],
       [ 2.09577106, -0.39302181, -0.43480224,  1.38408042],
       [-1.46798992, -0.78184055, -0.02857624,  0.59969207]])

In [18]:
# A quick statistic summary of our data

df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.384145,-0.557031,0.65329,0.423224
std,1.248321,0.278135,1.259809,0.777081
min,-1.46799,-0.90498,-0.434802,-0.879562
25%,-0.074714,-0.75436,-0.143002,0.142147
50%,0.182796,-0.552477,0.323581,0.656231
75%,1.149733,-0.403025,0.83045,0.727741
max,2.095771,-0.15739,3.006503,1.38408


## Let's get our hands dirty with some real data

Download sample data from here: https://github.com/opencubelabs/notebooks/blob/master/data_science/data/ign.csv

> mkdir data <br>
> mv ~/Downloads ./data/ <br>
> jupyter notebook

We are going to use game review data by IGN, a leading game review platform. Data contains following columns:

* score_phrase — how IGN described the game in one word. This is linked to the score it received.
* title — the name of the game.
* url — the URL where you can see the full review.
* platform — the platform the game was reviewed on (PC, PS4, etc).
* score — the score for the game, from 1.0 to 10.0.
* genre — the genre of the game.
* editors_choice — N if the game wasn't an editor's choice, Y if it was. This is tied to score.
* release_year — the year the game was released.
* release_month — the month the game was released.
* release_day — the day the game was released.


In [19]:
reviews = pd.read_csv("data/ign.csv")

print reviews.head(5)

IOError: File data/ign.csv does not exist

In [None]:
print reviews.tail(5)

In [None]:
reviews.shape # pandas.DataFrame.shape property to see row many rows and columns are in reviews

### Indexing DataFrames with Pandas

**pandas.DataFrame.iloc**

iloc method allows us to retrieve rows and columns by position.

In [None]:
reviews.iloc[0:5,:]

Here are some indexing examples:

   * reviews.iloc[:5,:] — the first 5 rows, and all of the columns for those rows.
   * reviews.iloc[:,:] — the entire DataFrame.
   * reviews.iloc[5:,5:] — rows from position 5 onwards, and columns from position 5 onwards.
   * reviews.iloc[:,0] — the first column, and all of the rows for the column.
   * reviews.iloc[9,:] — the 10th row, and all of the columns for that row.

Now let's remove the first column from reviews as it does not contain any useful information.

In [None]:
reviews = reviews.iloc[:,1:]
reviews.head()

### Indexing Using Labels in Pandas

**pandas.DataFrame.loc**

A major advantage of Pandas over NumPy is that each of the columns and rows has a label. Working with column positions is possible, but it can be hard to keep track of which number corresponds to which column.

In [None]:
# displaying first 5 reviews using loc

reviews.loc[0:5,:]

In [None]:
reviews.index

In [None]:
# Get row 10 to row 20 of reviews, and assign the result to some_reviews. Also, display the first 5 rows of some_reviews.

some_reviews = reviews.iloc[10:20,]
some_reviews.head()

In [None]:
some_reviews.loc[10:21,:]

In [None]:
reviews.loc[:5,"score"]

In [None]:
reviews.loc[:5,["score", "release_year"]] # we can specify more than one lable in a list

In [None]:
# each column and row in a Pandas DataFrame is a Series

type(reviews["score"])

### Introduction to Pandas DataFrame Methods

In [None]:
reviews["title"].head()

In [None]:
reviews["score"].mean()

In [None]:
reviews.mean()

In [None]:
reviews.mean(axis=1)

Few handy methods for Series and DataFrames:

   * pandas.DataFrame.corr — finds the correlation between columns in a DataFrame.
   * pandas.DataFrame.count — counts the number of non-null values in each DataFrame column.
   * pandas.DataFrame.max — finds the highest value in each column.
   * pandas.DataFrame.min — finds the lowest value in each column.
   * pandas.DataFrame.median — finds the median of each column.
   * pandas.DataFrame.std — finds the standard deviation of each column.


In [None]:
reviews.corr()

In [None]:
reviews["score"] / 2  # DataFrame maths

In [None]:
# Boolean indexing in Pandas

score_filter = reviews["score"] > 7
score_filter.head(10)

In [None]:
# select rows in reviews where score is greater than 7

filtered_reviews = reviews[score_filter]
filtered_reviews.head()

Let's try multiple conditions

* Setup a filter with two conditions:
    * Check if score is greater than 7.
    * Check if platform equals Xbox One
* Apply the filter to reviews to get only the rows we want.
* Use the head method to print the first 5 rows of filtered_reviews

In [None]:
xbox_one_filter = (reviews["score"] > 7) & (reviews["platform"] == "Xbox One")
filtered_reviews = reviews[xbox_one_filter]
filtered_reviews.head()

### Pandas Plotting

In the below code, we:

   * Call %matplotlib inline to set up plotting inside a Jupyter notebook.
   * Filter reviews to only have data about the Xbox One.
   * Plot the score column.


In [None]:
%matplotlib inline
reviews[reviews["platform"] == "Xbox One"]["score"].plot()

In [None]:
reviews[reviews["platform"] == "Xbox One"]["score"].plot(kind='hist')

In [None]:
reviews[reviews["platform"] == "PlayStation 4"]["score"].plot(kind="hist")

In [None]:
filtered_reviews["score"].hist()

![Thank You](https://swizzlebooks.com/img/ocl/ty.png)