# Using Python for Data Science:

## Introduction-

Data science is the hot new field in the information technology industry. Its popularity has been steadily increasing over the past few years. This increase in the sector has been caused by the information explosion that has taken place over the past few years. 

In fact, according to Forbes Magazine:

> * Experts are predicting a 4,300 percent increase in annual data production by 2020.
> * On average, companies use only a fraction of the data they collect and store.

The information explosion has been accompanied by a need for astute business analysts who are also equipped to program and build models- data scientists. Traditionally, these individuals preferred to use the R programming language to do their work. However, R is quickly being replaced by Python in this space.  

We can easily visualize this growing trend of using Python for data science. For example, the popular website "KDnuggets.com" posted a graph that shows pythons growing usage by searching the key words used in job postings:

![KDNuggets Python Graph](https://www.ibm.com/developerworks/community/blogs/jfp/resource/BLOGS_UPLOADED_IMAGES/trends0.png).

From this graph, it is obvious that anybody looking for a hot new job in this growing field should definitely become experienced with Python in order to further their career prospects and increase their overall earning potential. 

Lets get to it then.

## Getting Started-

In this article we will focus on the use of Python specifically for the purpose of carrying out data science tasks such as data cleansing and building machine learning models. As such, we will skip over the more general programming concepts and reference them only when needed to develop our specific tasks. 

The main components of any data science project are:

**1) Import the Required Libraries**

**2) Load and Manipulate the Data**

**3) Build Models**

**4) Compare the Results**

These are the components that we will focus on in this workshop. 

### Import the Required Libraries:

One of the things that makes python so powerful is that it is a free, open-source language. As a result, many talented people have created prepackaged bundles of code that can be used in our projects so that we dont have to start from scratch. This code is usually stored in the form of packages or libraries, which can easily be imported to our computer using a few simple steps.

In [1]:
# Import the required libraries
import os
import csv
import pandas as pd

We have just imported some libraries into python that we can use to read our data and set-up our workspace.

* `os` is a library that provides many functions for setting up a machine-independent workspace in python
* `pandas` is a popular library used for reading and manipulating data stored in the form of "data frames"

Before we go further, lets set-up our workspace by defining the working directory.

In [2]:
# Find the working directory
print(os.getcwd())

/Users/zansadiq/Documents/Code/github/Thinkful


In [3]:
# Define a new directory
path = 'path/to/files'

# Change the working directory
# os.chdir(path)

### Load and Manipulate the Data:

It goes without saying that in order to carry out a data science task, we must start with some data. This can come in a variety of different formats such as `.csv` or `.xlsx`, `.json`, etc. 

Data can exist locally on our machine, or it may exist somewhere on the internet and need to be downloaded before it can be read. For today's tutorial, we will go ahead and use the famous "Titanic" dataset from the Kaggle website to play around with. The files can be downloaded [*here*](https://www.kaggle.com/c/titanic/data) and they are already split into a training and testing set for us, which is convenient. 

In [None]:
# Load the data from a local file using the csv module
with open('file.csv') as csvDataFile:
    csvReader = csv.reader(csvDataFile)
    for row in csvReader:
        print(row)
