# Data Manipulation

Some aspects of __data manipulation__, altering data to make it easier to read or use, include sorting and grouping attributes and encoding categorical variables.

In [7]:
# loads the pandas library 
import pandas as pd
import re

# creates data frame named df by reading in the Baltimore csv
df = pd.read_csv("clean_baltimore_data.csv")
df.head(n=3)

Unnamed: 0.1,Unnamed: 0,Form,State,Security_Grade,Area_Number,Terrain_Description,Favorable_Influences,Detrimental_Influences,INHABITANTS_Type,INHABITANTS_Annual_Income,...,Ten_Fifteen_Desirability,Remarks,Date,City_clean,Suburb,max_building_age,Year,Day,Month,max_annual_income
0,0,NS FORM-8 6-1-37,Maryland,A,2,Rolling,Fairly new suburban area of homogeneous charac...,No,Substantial Middle Class,"$3000 - 5,000",...,Upward,A recent development with much room for expans...,"May 4,1937",Baltimore,,10.0,1937.0,4.0,May,5000.0
1,1,NS FORM-8 6-1-37,Maryland,A,1,Undulating,Very nicely planned residential area of medium...,No,"Executives, Professional Men",over $5000,...,Upward,Mostly fee properties. A few homes valued at $...,"May 4,1937",Baltimore,,12.0,1937.0,4.0,May,5000.0
2,2,NS FORM-8 6-1-37,Maryland,A,3,Rolling,Good residential area. Well planned.,Distance to City,"Executives, Professional Men",3500 - 7000,...,Upward,Principally fee property. This section lies in...,"May 4,1937",Baltimore,,20.0,1937.0,4.0,May,7000.0


### Sorting and Grouping 

The values of `Area_Number` are out of order and we want these values to be sorted by `Security_Grade`. 

In [12]:
# removes any additional spaces from Security_Grade
df['Security_Grade'] = df['Security_Grade'].str.replace('[\W]','')
# converts 'Area_Number' from type object to type 'numeric'
df['Area_Number'] = pd.to_numeric(df['Area_Number'])
df.ix[0:10,['Security_Grade','Area_Number']]

Unnamed: 0,Security_Grade,Area_Number
0,A,1
1,A,2
2,A,4
3,A,6
4,A,3
5,A,5
6,B,1
7,B,2
8,B,3
9,B,4


To do this, we created use the `sort_values()` function on the original data frame and reset the index. First, the data is sorted and grouped by `Security_Grade` and then `Area_Number` is sorted in increasing order.

In [15]:
df = df.sort_values(by=['Security_Grade', 'Area_Number'])
# resets the index starting from 0
df = df.reset_index(drop=True)
# in order to save the new sorted area_number into the dataframe
df['Area_Number'] = df['Area_Number']
df.ix[0:10,['Security_Grade','Area_Number']]

Unnamed: 0,Security_Grade,Area_Number
0,A,1
1,A,2
2,A,3
3,A,4
4,A,5
5,A,6
6,B,1
7,B,2
8,B,3
9,B,4


### Encoding Categorical Variables

In [9]:
 df.ix[0:15,'INHABITANTS_Population_Increase']

0               Fast 
1     Moderately fast
2     Moderately fast
3              Slowly
4     Moderately fast
5                 NaN
6                 NaN
7              Slowly
8     Moderately fast
9     Moderately fast
10                NaN
11             Slowly
12             Slowly
13         Moderately
14             Slowly
15              Fast 
Name: INHABITANTS_Population_Increase, dtype: object

In [14]:
df.to_csv(r'manipulated_baltimore_data.csv')

Continue to the data analysis and visualization portion of this module [here](Data Visualization.ipynb)