## Data Transformation

### Removing Duplicates


In [2]:
import pandas as pd

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


- __The DataFrame method 'duplicated' returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not:__

In [4]:
# is duplicate

data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

- __Let's remove duplicate parts__

In [6]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


### Transforming Data Using a Function or Mapping

In [7]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon','Pastrami', 'corned beef', 'Bacon','pastrami', 'honey ham', 'nova lox'],
                    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


- __Let's say you wanted to add column that indicate the type of animal that each food come from. Let's write down mapping of each distinct meat type to kind of animal:__

In [8]:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

- __The 'map' method on a Series accepts a function or dict-like object containing a mapping, but here we have a small problem in that some of the meats are capitalized and others are not. Thus, we need to convert each value to lowercase using the 'str.lower' Series method:__

In [9]:
lowercased_food = data['food']
lowercased_food

0          bacon
1    pulled pork
2          bacon
3       Pastrami
4    corned beef
5          Bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [10]:
data['animal'] = lowercased_food.map(meat_to_animal)

In [13]:
#Now let's check data

data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,
4,corned beef,7.5,cow
5,Bacon,8.0,
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


### Replacing values

- __Filling in missing data with the fillna method is a special case of more general value
replacement. As you’ve already seen, map can be used to modify a subset of values in
an object but replace provides a simpler and more flexible way to do so__

In [14]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

- __Let's use replace method__

In [22]:
data.replace(-999., 999)

0       1.0
1     999.0
2       2.0
3     999.0
4   -1000.0
5       3.0
dtype: float64

In [23]:
data.replace([1,2],[11,22])

0      11.0
1    -999.0
2      22.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

### Renaming Axis Indexes

- __We can also transformed the axis lable by using the mapping function without producing new labled objects. We can also modify the axis without creating or producing new data structure__

In [24]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four']
                   )
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


- __Like a Series, the axis indexes have a map method:__


In [26]:
transform = lambda x: x.upper()
data.index.map(transform)

Index(['OHIO', 'COLORADO', 'NEW YORK'], dtype='object')

In [28]:
data.columns.map(transform)

Index(['ONE', 'TWO', 'THREE', 'FOUR'], dtype='object')

- __You can assign to index , modifying the DataFrame in-place:__

In [29]:
data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


- __If you want to create a transformed version of a dataset without modifying the original, a useful method is rename :__

In [33]:
data.rename(index=str.upper, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
OHIO,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


- __We can also rename column name__


In [34]:
data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


### Discretization and Binning

- __Let's say you data about the group of people in the study and your want to put them into groups into dscrete age buckets__

In [3]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

- __Let's deivde these into group 18-25, 26-35, 36-60__
- __for this we will use 'cut' method of pandas__

In [1]:
bins = [18, 25, 35, 60, 100]

[18, 25, 35, 60, 100]

In [4]:
import pandas as pd

cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [20]:
# let's find categories

cats.categories

IntervalIndex([(-3.878, -2.09], (-2.09, -0.309], (-0.309, 1.473], (1.473, 3.254]],
              closed='right',
              dtype='interval[float64]')

In [7]:
#couunt values in that range

pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

- __another example__

In [9]:
import numpy as np

data = np.random.rand(20)
data

array([0.50524027, 0.82536114, 0.87897729, 0.95991592, 0.00462971,
       0.03265415, 0.37242904, 0.82340564, 0.79455816, 0.34849283,
       0.68462261, 0.68924725, 0.78371243, 0.84116093, 0.912038  ,
       0.13426596, 0.50474311, 0.43686343, 0.3587991 , 0.29840108])

- __we can pass an integer number of bins to cut instead of explicit bin edges__

In [11]:
pd.cut(data,4,precision=2)

[(0.48, 0.72], (0.72, 0.96], (0.72, 0.96], (0.72, 0.96], (0.0037, 0.24], ..., (0.0037, 0.24], (0.48, 0.72], (0.24, 0.48], (0.24, 0.48], (0.24, 0.48]]
Length: 20
Categories (4, interval[float64]): [(0.0037, 0.24] < (0.24, 0.48] < (0.48, 0.72] < (0.72, 0.96]]

- __Another example__

In [12]:
data = np.random.randn(1000)
data

array([-0.28329751,  0.69647383,  0.50932473, -0.19788467,  0.3805382 ,
        0.79857702, -0.39722477, -1.15207044, -0.31161365,  1.32508242,
       -1.43234083, -0.859545  , -0.84025562, -0.34535447, -0.63715876,
       -0.28001133, -2.72469632,  1.09097762,  0.08838981, -0.01332076,
        0.6081183 ,  0.1504571 , -1.27992839,  0.80678876,  0.14207136,
       -0.12238775,  0.10154687, -1.48925138,  1.10591033,  0.34753241,
        0.62966578, -2.02831419,  1.05901848,  0.52347709,  0.57545962,
       -0.47828719, -0.86736387,  0.88996649, -1.23015075,  0.68688127,
       -0.77093433, -0.91926497,  1.23646628,  0.09595641,  2.26254434,
        2.05952892,  0.29047633,  0.42038511,  1.73783325,  0.10718633,
       -0.20088158, -2.04345659, -0.37325266,  0.22527216, -0.50674888,
       -1.164767  ,  0.17728947,  0.43819735, -0.52788876, -0.4102603 ,
       -0.57712589,  1.3416967 ,  0.10667698, -1.68010006, -1.14046142,
       -0.48153535, -1.52800811, -1.98560062,  0.2916103 , -1.85

In [14]:
cats = pd.cut(data,4)
cats

[(-0.309, 1.473], (-0.309, 1.473], (-0.309, 1.473], (-0.309, 1.473], (-0.309, 1.473], ..., (-0.309, 1.473], (1.473, 3.254], (-2.09, -0.309], (-0.309, 1.473], (-0.309, 1.473]]
Length: 1000
Categories (4, interval[float64]): [(-3.878, -2.09] < (-2.09, -0.309] < (-0.309, 1.473] < (1.473, 3.254]]

In [15]:
pd.value_counts(cats)

(-0.309, 1.473]    563
(-2.09, -0.309]    356
(1.473, 3.254]      63
(-3.878, -2.09]     18
dtype: int64

### Detecting and Filtering Outliers
- __Filtering or transforming outliers is largely a matter of applying array operations.__


In [16]:
data = pd.DataFrame(np.random.randn(1000, 4))
data

Unnamed: 0,0,1,2,3
0,0.739944,0.094338,0.361905,0.475463
1,-0.182451,2.482996,2.190688,-0.131998
2,0.453916,-0.604206,-1.056480,-1.363392
3,-1.454186,1.175676,-1.018282,0.003432
4,-0.588149,1.129856,-0.865058,-1.303113
...,...,...,...,...
995,0.704879,-0.069154,0.585960,1.325228
996,1.303773,-0.449581,-1.290746,1.483015
997,-0.486026,0.973949,0.318537,-1.849429
998,0.162705,-1.735722,0.219019,1.209766


In [18]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.054164,-0.023482,-0.001122,0.010325
std,1.022017,1.034671,0.93068,1.030806
min,-2.864327,-3.307033,-3.331742,-3.133937
25%,-0.713493,-0.710143,-0.603772,-0.66343
50%,-0.049706,-0.051692,-0.006663,0.013526
75%,0.654988,0.654663,0.593711,0.726343
max,3.325734,3.411398,2.578913,2.817692


- __Let's say you want to find value in one of the columns exceeding in 3 absolute values__


In [22]:
col = data[2]
col

0      0.361905
1      2.190688
2     -1.056480
3     -1.018282
4     -0.865058
         ...   
995    0.585960
996   -1.290746
997    0.318537
998    0.219019
999   -0.055143
Name: 2, Length: 1000, dtype: float64

In [23]:
col[np.abs(col)>3]

864   -3.331742
Name: 2, dtype: float64

- __To select all rows having a value exceeding 3 or –3, you can use the any method on a
boolean DataFrame:__

In [24]:
data[(np.abs(data)>3).any(1)]

Unnamed: 0,0,1,2,3
7,3.075076,0.27026,-0.659502,0.388007
42,0.056894,-1.337601,-0.776966,-3.133937
164,-2.239333,3.411398,0.103656,0.933392
320,0.748741,-3.04354,0.423652,-0.255932
821,3.325734,0.514552,-0.094404,0.431691
864,0.022212,0.591466,-3.331742,0.43104
880,0.231857,-3.307033,2.138216,1.513971


- __Values can be set based on these criteria. Here is code to cap values outside the inter‐
val –3 to 3:__

In [25]:
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.054565,-0.023542,-0.00079,0.010459
std,1.020762,1.032327,0.92955,1.030405
min,-2.864327,-3.0,-3.0,-3.0
25%,-0.713493,-0.710143,-0.603772,-0.66343
50%,-0.049706,-0.051692,-0.006663,0.013526
75%,0.654988,0.654663,0.593711,0.726343
max,3.0,3.0,2.578913,2.817692


### Permutation and Random Sampling

- __Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do
using the 'numpy.random.permutation' function__

In [26]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [30]:
sampler = np.random.permutation(5)
sampler

array([2, 3, 4, 0, 1])

- __'take' method__

In [32]:
df.take(sampler)

Unnamed: 0,0,1,2,3
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19
0,0,1,2,3
1,4,5,6,7


- __'sample' method to take sample__


In [37]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
3,12,13,14,15
0,0,1,2,3
4,16,17,18,19


### Computing Indicator/Dummy Variables

- __Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a “dummy” or “indicator” matrix__
- __If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame with k columns containing all 1s and 0s__
- __pandas have 'get_dummies' fucntion for doing this__

In [4]:
import pandas as pd

df = pd.DataFrame({
    'key': ['b', 'b', 'a', 'c', 'a', 'b'],
    'data1': range(6)
})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [3]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


- __in some cases, you want to add a prefix to the columns in the indicator DataFrmae, which can be merged with the other data.__
- __'get_dummies' has a prefix argument for doing it__

In [5]:
dummies = pd.get_dummies(df['key'], prefix='key')
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [6]:
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


- __If a row in a DataFrame belongs to multiple categories, things are bit more complicated__
- __Let's look at the MovieLens 1M dataset__

In [30]:
mnames = ['movie_id', 'title', 'genres']

In [33]:
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)
movies[:10]

  movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [36]:
all_genre = []

In [39]:
for i in movies.genres:
    all_genre.extend(i.split('|'))
    
pd.unique(all_genre)

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

In [None]:
gen = movies.genres[0]
gen