During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst’s time.

# Handling Missing Data

For numeric data, pandas uses the floating-point value (sentinel value) NaN (Not a Number) to represent missing data.

In [4]:
import pandas as pd
import numpy as np

In [5]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [6]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

Some functions related to missing data handling -
![image.png](attachment:image.png)

## Filtering Out Missing Data

In [7]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [8]:
data[data.notnull()] # equivalent as above

0    1.0
2    3.5
4    7.0
dtype: float64

Note - In DataFrame dropna by default drops any row containing a missing value

In [11]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [12]:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing how='all' will only drop rows that are all NA:

In [13]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns in the same way, pass axis=1:

In [14]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [15]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In time series data suppose you want to keep only rows containing a certain number of observations then you can use thresh argument -

In [16]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,0.230365,,
1,-0.848965,,
2,0.706791,,0.119542
3,1.197569,,-1.088774
4,1.190446,-0.981952,-1.930927
5,0.427529,-1.386895,-0.555427
6,-0.959329,-0.285006,1.10054


In [17]:
df.dropna()

Unnamed: 0,0,1,2
4,1.190446,-0.981952,-1.930927
5,0.427529,-1.386895,-0.555427
6,-0.959329,-0.285006,1.10054


In [18]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.706791,,0.119542
3,1.197569,,-1.088774
4,1.190446,-0.981952,-1.930927
5,0.427529,-1.386895,-0.555427
6,-0.959329,-0.285006,1.10054


## Filling In Missing Data

In [19]:
df

Unnamed: 0,0,1,2
0,0.230365,,
1,-0.848965,,
2,0.706791,,0.119542
3,1.197569,,-1.088774
4,1.190446,-0.981952,-1.930927
5,0.427529,-1.386895,-0.555427
6,-0.959329,-0.285006,1.10054


In [20]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.230365,0.0,0.0
1,-0.848965,0.0,0.0
2,0.706791,0.0,0.119542
3,1.197569,0.0,-1.088774
4,1.190446,-0.981952,-1.930927
5,0.427529,-1.386895,-0.555427
6,-0.959329,-0.285006,1.10054


In [21]:
df

Unnamed: 0,0,1,2
0,0.230365,,
1,-0.848965,,
2,0.706791,,0.119542
3,1.197569,,-1.088774
4,1.190446,-0.981952,-1.930927
5,0.427529,-1.386895,-0.555427
6,-0.959329,-0.285006,1.10054


Calling fillna with a dict, you can use a different fill value for each column:

In [22]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,0.230365,0.5,0.0
1,-0.848965,0.5,0.0
2,0.706791,0.5,0.119542
3,1.197569,0.5,-1.088774
4,1.190446,-0.981952,-1.930927
5,0.427529,-1.386895,-0.555427
6,-0.959329,-0.285006,1.10054


fillna returns a new object, but you can modify the existing object in-place:

In [24]:
df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,0.230365,0.0,0.0
1,-0.848965,0.0,0.0
2,0.706791,0.0,0.119542
3,1.197569,0.0,-1.088774
4,1.190446,-0.981952,-1.930927
5,0.427529,-1.386895,-0.555427
6,-0.959329,-0.285006,1.10054


In [25]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.351618,-1.151594,0.653102
1,0.97737,2.262805,-1.251567
2,0.50217,,0.445576
3,0.80543,,1.13042
4,0.415481,,
5,1.13308,,


In [26]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.351618,-1.151594,0.653102
1,0.97737,2.262805,-1.251567
2,0.50217,2.262805,0.445576
3,0.80543,2.262805,1.13042
4,0.415481,2.262805,1.13042
5,1.13308,2.262805,1.13042


In [27]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.351618,-1.151594,0.653102
1,0.97737,2.262805,-1.251567
2,0.50217,2.262805,0.445576
3,0.80543,2.262805,1.13042
4,0.415481,,1.13042
5,1.13308,,1.13042


With fillna you can do lots of other things with a little creativity. For example, you might pass the mean or median value of a Series:

In [28]:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

reference on fillna -

![image.png](attachment:image.png)

# Data Transformation
Till now it was rearranging data. Filtering, cleaning, and other transformations are another class of important operations.
## Removing Duplicates

In [29]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method duplicated returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not:

In [30]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Relatedly, drop_duplicates returns a DataFrame where the duplicated array is False:

In [31]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of these methods by default consider all of the columns; alternatively, you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:

In [32]:
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [34]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


duplicated and drop_duplicates by default keep the first observed value combination. Passing keep='last' will return the last one:

In [35]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


## Transforming Data Using a Function or Mapping
Using map is a convenient way to perform element-wise transformations and other data cleaning–related operations.

In [36]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon', 
                              'pastrami', 'honey ham', 'nova lox'], 
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you want to add below column then you can use map() to add these values.  

In [37]:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

But note the food column has same values but casing is different hence we need to change all to lower - 

In [39]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

Now we can add the column -

In [41]:
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


Same can be acheived using lambda as below - 

In [43]:
data['animal2'] = data['food'].map(lambda x: meat_to_animal[x.lower()])
data

Unnamed: 0,food,ounces,animal,animal2
0,bacon,4.0,pig,pig
1,pulled pork,3.0,pig,pig
2,bacon,12.0,pig,pig
3,Pastrami,6.0,cow,cow
4,corned beef,7.5,cow,cow
5,Bacon,8.0,pig,pig
6,pastrami,3.0,cow,cow
7,honey ham,5.0,pig,pig
8,nova lox,6.0,salmon,salmon


## Replacing Values
Filling in missing data with the fillna method is a special case of more general value replacement. As you’ve already seen, map can be used to modify a subset of values in an object but replace provides a simpler and more flexible way to do so. Let’s consider this Series:

In [47]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

Now -999 appears to be sentinel now to replace with NA we can use replace() -

In [48]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list and then the substitute value:

In [49]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes:

In [50]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dict:

In [51]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

Note - The data.replace method is distinct from data.str.replace, which performs string substitution element-wise.

## Renaming Axis Indexes
Axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. You can also modify the axes in-place without creating a new data structure.

In [52]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)), 
                    index=['Ohio', 'Colorado', 'New York'], 
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [54]:
data.index.map(lambda x : x.upper()) # but this will not change orignal index

Index(['OHIO', 'COLORADO', 'NEW YORK'], dtype='object')

In [55]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [56]:
data.index = data.index.map(lambda x : x.upper())
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


If you want to create a transformed version of a dataset without modifying the original, a useful method is rename:

In [58]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


Notably, rename can be used in conjunction with a dict-like object providing new values for a subset of the axis labels:

In [59]:
data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


If you want to make to orignal data structure just add inplace = True to it

## Discretization and Binning
Continuous data is often discretized or otherwise separated into “bins” for analysis.  
Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so, you have to use cut, a function in pandas:

In [60]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special Categorical object. The output you see describes the bins computed by pandas.cut. You can treat it like an array of strings indicating the bin name; internally it contains a categories array specifying the distinct category names along with a labeling for the ages data in the codes attribute:

In [61]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [62]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [63]:
pd.value_counts(cats)  # note its bin count for pandas.cut

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

Consistent with mathematical notation for intervals, a parenthesis means that the side is open, while the square bracket means it is closed (inclusive). You can change which side is closed by passing right=False:

In [64]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the labels option:

In [65]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

If you pass an integer number of bins to cut instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths:  
The precision=2 option limits the decimal precision to two digits.

In [66]:
data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.28, 0.51], (0.054, 0.28], (0.054, 0.28], (0.74, 0.96], (0.74, 0.96], ..., (0.51, 0.74], (0.054, 0.28], (0.51, 0.74], (0.51, 0.74], (0.28, 0.51]]
Length: 20
Categories (4, interval[float64]): [(0.054, 0.28] < (0.28, 0.51] < (0.51, 0.74] < (0.74, 0.96]]

A closely related function, qcut, bins the data based on sample quantiles. Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. Since qcut uses sample quantiles instead, by definition you will obtain roughly equal-size bins:

In [67]:
data = np.random.randn(1000) # Normally distributed
cats = pd.qcut(data, 4) # Cut into quartiles
cats

[(-0.0495, 0.65], (-3.621, -0.686], (0.65, 3.744], (0.65, 3.744], (0.65, 3.744], ..., (0.65, 3.744], (-3.621, -0.686], (-0.686, -0.0495], (0.65, 3.744], (-0.0495, 0.65]]
Length: 1000
Categories (4, interval[float64]): [(-3.621, -0.686] < (-0.686, -0.0495] < (-0.0495, 0.65] < (0.65, 3.744]]

In [68]:
pd.value_counts(cats)

(0.65, 3.744]        250
(-0.0495, 0.65]      250
(-0.686, -0.0495]    250
(-3.621, -0.686]     250
dtype: int64

Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive):

In [69]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(-0.0495, 1.214], (-1.232, -0.0495], (-0.0495, 1.214], (1.214, 3.744], (-0.0495, 1.214], ..., (-0.0495, 1.214], (-3.621, -1.232], (-1.232, -0.0495], (-0.0495, 1.214], (-0.0495, 1.214]]
Length: 1000
Categories (4, interval[float64]): [(-3.621, -1.232] < (-1.232, -0.0495] < (-0.0495, 1.214] < (1.214, 3.744]]

## Detecting and Filtering Outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

In [70]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.009477,0.001099,0.01764,0.022625
std,0.986874,1.006846,1.010323,1.033354
min,-3.023459,-2.984238,-3.972118,-3.905722
25%,-0.694507,-0.684664,-0.685053,-0.639549
50%,0.042741,-0.015331,0.022017,-0.019753
75%,0.682985,0.651511,0.704007,0.713073
max,3.478729,3.35478,3.326093,3.075544


Suppose you wanted to find values in one of the columns exceeding 3 in absolute value:

In [72]:
data[2][np.abs(data[2]) > 3]

170    3.070011
309    3.326093
587   -3.163856
859   -3.972118
Name: 2, dtype: float64

In [73]:
data[(np.abs(data) > 3).any(1)] 
# any() - Return whether any element is True, potentially over an axis.

Unnamed: 0,0,1,2,3
30,0.093876,-0.041049,0.714928,-3.110048
93,0.250427,1.639086,0.372569,-3.461848
112,-0.555095,3.35478,0.033115,-2.23346
170,-1.516595,0.450361,3.070011,1.698275
275,-3.023459,1.366491,0.434271,-0.732849
309,0.40842,1.717048,3.326093,0.046889
410,3.478729,-1.23186,-0.39568,0.596956
587,0.012714,0.589577,-3.163856,1.483267
596,1.45115,-0.066785,-1.094446,-3.905722
717,0.865881,1.408799,1.675446,3.075544


Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3:

In [76]:
data[np.abs(data) > 3] = np.sign(data) * 3
# The `sign` function returns ``-1 if x < 0, 0 if x==0, 1 if x > 0``.
# nan is returned for nan inputs.
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.009932,0.000744,0.018379,0.024027
std,0.985224,1.005725,1.005207,1.028287
min,-3.0,-2.984238,-3.0,-3.0
25%,-0.694507,-0.684664,-0.685053,-0.639549
50%,0.042741,-0.015331,0.022017,-0.019753
75%,0.682985,0.651511,0.704007,0.713073
max,3.0,3.0,3.0,3.0


In [77]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,-1.0,1.0,-1.0
1,-1.0,-1.0,1.0,1.0
2,1.0,1.0,1.0,-1.0
3,1.0,1.0,-1.0,1.0
4,-1.0,-1.0,1.0,-1.0


## Permutation and Random Sampling

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the numpy.random.permutation function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering:

In [78]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
sampler = np.random.permutation(5)
sampler

array([2, 4, 1, 0, 3])

In [79]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


That array can then be used in iloc-based indexing or the equivalent take function:

In [80]:
df.take(sampler)

Unnamed: 0,0,1,2,3
2,8,9,10,11
4,16,17,18,19
1,4,5,6,7
0,0,1,2,3
3,12,13,14,15


In [85]:
df.iloc[sampler]

Unnamed: 0,0,1,2,3
2,8,9,10,11
4,16,17,18,19
1,4,5,6,7
0,0,1,2,3
3,12,13,14,15


To select a random subset without replacement, you can use the sample method on Series and DataFrame:

In [86]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
0,0,1,2,3
4,16,17,18,19
3,12,13,14,15


To generate a sample with replacement (to allow repeat choices), pass replace=True to sample:

In [88]:
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace = True)
draws

2   -1
2   -1
4    4
4    4
2   -1
2   -1
1    7
1    7
0    5
4    4
dtype: int64

In [89]:
# Note if you try to give sample more than no. of elements with 
# replace = False then it will give you ValueError -
# ValueError: Cannot take a larger sample than 
# population when 'replace=False'

## Computing Indicator/Dummy Variables
Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a DataFrame has k distinct values, you would derive a matrix or Data‐ Frame with k columns containing all 1s and 0s. pandas has a get_dummies function for doing this, though devising one yourself is not difficult. Let’s return to an earlier example DataFrame:

In [91]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [92]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In some cases, you may want to add a prefix to the columns in the indicator Data‐ Frame, which can then be merged with the other data. get_dummies has a prefix argument for doing this:

In [93]:
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


A useful recipe for statistical applications is to combine get_dummies with a discretization function like cut:

In [94]:
np.random.seed(12345)
values = np.random.rand(10)
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [96]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


# String Manipulation
Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing. pandas adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data, additionally handling the annoyance of missing data.

## String Object Methods
In many string munging and scripting applications, built-in string methods are sufficient. As an example, a comma-separated string can be broken into pieces with split. split is often combined with strip to trim whitespace (including line breaks)

In [99]:
val = 'a,b, guido'
val.split(',')

['a', 'b', ' guido']

In [101]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimiter using addition:

In [102]:
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

But this isn’t a practical generic method. A faster and more Pythonic way is to pass a list or tuple to the join method on the string '::':

In [104]:
'::'.join(pieces)

'a::b::guido'

Other methods are concerned with locating substrings. Using Python’s in keyword is the best way to detect a substring, though index and find can also be used:

In [106]:
val

'a,b, guido'

In [107]:
'guido' in val

True

In [108]:
val.index(',')

1

In [109]:
val.find(':')

-1

Note the difference between find and index is that index raises an exception if the string isn’t found (versus returning –1):

In [111]:
# val.index(':') # ValueError: substring not found

Relatedly, count returns the number of occurrences of a particular substring:

In [112]:
val.count(',')

2

replace will substitute occurrences of one pattern for another. It is commonly used to delete patterns, too, by passing an empty string:

In [113]:
val.replace(',', '::')

'a::b:: guido'

In [114]:
val.replace(',', '')

'ab guido'

Some of Python’s string methods-  
![image.png](attachment:image.png)

strip() :- This method is used to delete all the leading and trailing characters mentioned in its argument.  
lstrip() :- This method is used to delete all the leading characters mentioned in its argument.  
rstrip() :- This method is used to delete all the trailing characters mentioned in its argument.

## Regular Expressions
A single expression, commonly called a regex, is a string formed according to the regular expression language. Python’s built-in re module is responsible for applying regular expressions to strings. The re module functions fall into three categories: pattern matching, substitution, and splitting.  
Suppose we wanted to split a string with a variable number of whitespace characters (tabs, spaces, and newlines). The regex describing one or more whitespace characters is \s+:

In [115]:
import re
text = "foo bar\t baz \tqux"
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled, and then its split method is called on the passed text. You can compile the regex yourself with re.compile, forming a reusable regex object:  
(Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.)

In [116]:
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the regex, you can use the findall method:

In [117]:
regex.findall(text)

[' ', '\t ', ' \t']

To avoid unwanted escaping with \ in a regular expression, use raw string literals like r'C:\x' instead of the equivalent 'C:\\\x'.

match and search are closely related to findall. While findall returns all matches in a string, search returns only the first match. More rigidly, match only matches at the beginning of the string. As a less trivial example, let’s consider a block of text and a regular expression capable of identifying most email addresses:

In [119]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

search returns a special match object for the first email address in the text. For the preceding regex, the match object can only tell us the start and end position of the pattern in the string:

In [120]:
m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [121]:
text[m.start():m.end()]

'dave@google.com'

regex.match returns None, as it only will match if the pattern occurs at the start of the string:

In [123]:
print(regex.match(text))

None


re.search() vs re.match() –   
There is a difference between the use of both functions. Both return the first match of a substring found in the string, but re.match() searches only from the beginning of the string and return match object if found. But if a match of substring is found somewhere in the middle of the string, it returns none. While re.search() searches for the whole string even if the string contains multi-lines and tries to find a match of the substring in all the lines of string.

Relatedly, sub will return a new string with occurrences of the pattern replaced by the a new string:

In [127]:
print(regex.sub('<His Mail>', text))

Dave <His Mail>
Steve <His Mail>
Rob <His Mail>
Ryan <His Mail>



Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segment. It will return tuple:

In [130]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

In [136]:
m.group(0)

'wesm@bright.net'

In [137]:
m.group(1)

'wesm'

In [138]:
m.group(2)

'bright'

In [139]:
m.group(3)

'net'

findall returns a list of tuples when the pattern has groups:

In [131]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

sub also has access to groups in each match using special symbols like \1 and \2. The symbol \1 corresponds to the first matched group, \2 corresponds to the second, and so forth:

In [132]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



![image.png](attachment:image.png)

## Vectorized String Functions in pandas
a column containing strings will sometimes have missing data:

In [141]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [142]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

String and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, but it will fail on the NA (null) values. To cope with this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s str attribute; for example, we could check whether each email address has 'gmail' in it with str.contains:

In [145]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Regular expressions can be used, too, along with any re options like IGNORECASE:

In [146]:
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [147]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

There are a couple of ways to do vectorized element retrieval. Either use str.get or index into the str attribute :

In [150]:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

You can similarly slice strings using this syntax:

In [153]:
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

![image.png](attachment:image.png)

#### *Note - Most of the contents like images, examples, statements, etc in my notebooks / notes belongs to author "Wes McKinney" of book "Python for Data Analysis". I have collected / integrated them for study purpose and I don't own it.*