### Key components:
* pandas - load dataframe, slicing  
* feature conversion (categorical / texts --> numerical)  
* image to data (colored / grey scale)

#### Part 0: basic array manipulation in Python  
* array referencing
* reshape an array

In [1]:
# understand the index referencing of array
A = [1, 2, 3, 4]
A[0:1] # produce an array
A[0] # produce the value

1

In [2]:
# explore reshape
import numpy
a = numpy.array([ [1,2,3], [4,5,6] ])
print('a:', a)

a: [[1 2 3]
 [4 5 6]]


In [3]:
numpy.reshape(a,6)

array([1, 2, 3, 4, 5, 6])

In [4]:
numpy.reshape(a,(1,6))

array([[1, 2, 3, 4, 5, 6]])

In [5]:
a.reshape(1,6)

array([[1, 2, 3, 4, 5, 6]])

#### Part 1: explore the basics of pandas
* Load dataframe from csv files
* Initial processing: Boolean evaluation
* Data slicing

In [6]:
# pandas dataframe
import pandas as pd

In [7]:
flc = '/Users/pinqingkan/Desktop/Codes/Course_edX_PythonDataScience/02_Features_DataWrangling/Datasets/'
fname = flc + 'direct_marketing.csv'
df = pd.read_csv(fname)
df.head()

Unnamed: 0,recency,history_segment,history,mens,womens,zip_code,newbie,channel,segment,visit,conversion,spend,DM_category
0,10,2) $100 - $200,142.44,1,0,Surburban,0,Phone,Womens E-Mail,0,0,0.0,4
1,6,3) $200 - $350,329.08,1,1,Rural,1,Web,No E-Mail,0,0,0.0,11
2,7,2) $100 - $200,180.65,0,1,Surburban,1,Web,Womens E-Mail,0,0,0.0,1
3,9,5) $500 - $750,675.83,1,0,Rural,1,Web,Mens E-Mail,0,0,0.0,2
4,2,1) $0 - $100,45.34,1,0,Urban,0,Web,Womens E-Mail,0,0,0.0,4


In [8]:
df.mens.isnull().head()

0    False
1    False
2    False
3    False
4    False
Name: mens, dtype: bool

In [9]:
# data slicing
B = df.loc[0:9,['zip_code']]
B.head()

Unnamed: 0,zip_code
0,Surburban
1,Rural
2,Surburban
3,Rural
4,Urban


#### Part 2: Feature conversion using pandas
* Convert norminal features to numerical (2 methods)
* Convert texts to numerical features

In [10]:
# convert to numbers: one array for each value
B1 = pd.get_dummies(B, columns = ['zip_code'])
B1.head()

Unnamed: 0,zip_code_Rural,zip_code_Surburban,zip_code_Urban
0,0,1,0
1,1,0,0
2,0,1,0
3,1,0,0
4,0,0,1


In [11]:
# quick and dirty conversion
B.zip_code = B.zip_code.astype("category").cat.codes
B.head()

Unnamed: 0,zip_code
0,1
1,0
2,1
3,0
4,2


In [12]:
# all texts transformation
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
# count the word in the texts
corpus = [
    "Authman ran faster than Harry because he is an athlete.",
    "Authman and harry ran faster and faster."
]
corpus

['Authman ran faster than Harry because he is an athlete.',
 'Authman and harry ran faster and faster.']

In [14]:
Y = CountVectorizer() # many properties to change
Y

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [15]:
X = Y.fit_transform(corpus)
X # sparse matrix
Y.get_feature_names() # show the words
X.toarray()

array([[1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 2, 0, 1, 0, 2, 1, 0, 0, 1, 0]], dtype=int64)

#### Part 3: Data from images
* load images into a matrix
* colored / grey scale

In [16]:
# image processing
from scipy import misc

In [17]:
img = misc.imread('/Users/pinqingkan/OneDrive/2017 Spring/Nunan/Fcor_Ou_is52.tif')

In [18]:
X = (img / 255.0).reshape(-1,3)
X # 1D array, colored

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       ..., 
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [19]:
# 1D array, grey
red = X[:,0]
green = X[:,1]
blue = X[:,2]
Z = red*0.299 + green*0.587 + blue*0.114
Z

array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

In [20]:
# 1D array, grey
Y = (img / 255.0).reshape(-1)
Y

array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

In [21]:
# explore reshape
img.shape # (461, 588, 3)
X = (img / 255.0).reshape(461*588,3)
X

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       ..., 
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])