# 101 Pandas Exercises for Data Analysis

## Index
#### 21. How to convert a series of date-strings to a timeseries?
#### 22. How to get the day of month, week number, day of year and day of week from a series of date strings?
#### 23. How to convert year-month string to dates corresponding to the 4th day of the month?
#### 24. How to filter words that contain atleast 2 vowels from a series?
#### 25. How to filter valid emails from a series?
#### 26. How to get the mean of a series grouped by another series?
#### 27. How to compute the euclidean distance between two series?
#### 28. How to find all the local maxima (or peaks) in a numeric series?
#### 29. How to replace missing spaces in a string with the least frequent character?
#### 30. How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (saturdays) after that having random numbers as values?


## 21. How to convert a series of date-strings to a timeseries?

In [1]:
import pandas as pd
import numpy as np
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
ser

                the kernel may be left running.  Please let us know
                about your system (bitness, Python, etc.) at
                ipython-dev@scipy.org
  ipython-dev@scipy.org""")


0         01 Jan 2010
1          02-02-2011
2            20120303
3          2013/04/04
4          2014-05-05
5    2015-06-06T12:20
dtype: object

In [2]:
# Solution 1
pd.to_datetime(ser)

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

In [3]:
# Solution 2
from dateutil.parser import parse
ser.map(lambda x: parse(x))

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

## 22. How to get the day of month, week number, day of year and day of week from a series of date strings?

In [4]:
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
ser

0         01 Jan 2010
1          02-02-2011
2            20120303
3          2013/04/04
4          2014-05-05
5    2015-06-06T12:20
dtype: object

In [5]:
# Solution

from dateutil.parser import parse
ser_ts = ser.map(lambda x: parse(x))

# day of month
print("Date: ", ser_ts.dt.day.tolist())

# week number
print("Week number: ", ser_ts.dt.weekofyear.tolist())

# day of year
print("Day number of year: ", ser_ts.dt.dayofyear.tolist())

# day of week
print("Day of week: ", ser_ts.dt.weekday_name.tolist())

Date:  [1, 2, 3, 4, 5, 6]
Week number:  [53, 5, 9, 14, 19, 23]
Day number of year:  [1, 33, 63, 94, 125, 157]
Day of week:  ['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']


## 23. How to convert year-month string to dates corresponding to the 4th day of the month?

In [6]:
# Change ser to dates that start with 4th of the respective months.
ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])
ser

0    Jan 2010
1    Feb 2011
2    Mar 2012
dtype: object

In [7]:
# Solution 1
from dateutil.parser import parse
# Parse the date
ser_ts = ser.map(lambda x: parse(x))
ser_ts


0   2010-01-31
1   2011-02-28
2   2012-03-31
dtype: datetime64[ns]

In [8]:
# Construct date string with date as 4
ser_datestr = ser_ts.dt.year.astype('str') + '-' + ser_ts.dt.month.astype('str') + '-' + '04'
ser_datestr 


0    2010-1-04
1    2011-2-04
2    2012-3-04
dtype: object

In [9]:
# Format it.
[parse(i).strftime('%Y-%m-%d') for i in ser_datestr]



['2010-01-04', '2011-02-04', '2012-03-04']

In [10]:
# Solution 2
ser.map(lambda x: parse('04 ' + x))

0   2010-01-04
1   2011-02-04
2   2012-03-04
dtype: datetime64[ns]

## 24. How to filter words that contain atleast 2 vowels from a series?

In [11]:
ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

In [18]:
vow = ['a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U']

for i in ser:
    count = 0
    for j in i:
        if j in vow:
            count += 1
    if count >=2:
        print(count)
        print(i)

    
    

2
Apple
3
Orange
2
Money


In [19]:
# Solution 2
from collections import Counter
mask = ser.map(lambda x: sum([Counter(x.lower()).get(i, 0) for i in list('aeiou')]) >= 2)
ser[mask]

0     Apple
1    Orange
4     Money
dtype: object

## 25. How to filter valid emails from a series?

In [25]:
# Extract the valid emails from the series emails. The regex pattern for valid emails is provided as reference.

emails = pd.Series(['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])

pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'


In [34]:
# Solution 1 (as series of strings)
import re

mask = emails.map(lambda x: bool(re.match(pattern, x)))
emails[mask]

1    rameses@egypt.com
2            matt@t.co
3    narendra@modi.com
dtype: object

In [35]:
# Solution 2 (as series of list)
emails.str.findall(pattern, flags=re.IGNORECASE)


0                     []
1    [rameses@egypt.com]
2            [matt@t.co]
3    [narendra@modi.com]
dtype: object

In [36]:
# Solution 3 (as list)
[x[0] for x in [re.findall(pattern, email) for email in emails] if len(x) > 0]

['rameses@egypt.com', 'matt@t.co', 'narendra@modi.com']

## 26. How to get the mean and sum of a series grouped by another series?

In [38]:
#Compute the mean of weights of each fruit.

fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))
print(weights.tolist())
print(fruit.tolist())

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
['carrot', 'banana', 'carrot', 'apple', 'carrot', 'apple', 'banana', 'banana', 'banana', 'apple']


In [41]:
# Solution
weights.groupby(fruit).mean()

apple     6.666667
banana    6.500000
carrot    3.000000
dtype: float64

In [42]:
# Solution
weights.groupby(fruit).sum()

apple     20.0
banana    26.0
carrot     9.0
dtype: float64

## 27. How to compute the euclidean distance between two series?

In [43]:
# Compute the euclidean distance between series (points) p and q, without using a packaged formula.

p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

In [46]:
# Solution 
sum((p - q)**2) ** 0.5


18.16590212458495

In [47]:
# Solution (using func)
np.linalg.norm(p-q)

18.16590212458495

## 28. How to find all the local maxima (or peaks) in a numeric series?

In [48]:
ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])
ser

0     2
1    10
2     3
3     4
4     9
5    10
6     2
7     7
8     3
dtype: int64

In [51]:
np.diff(ser)

array([ 8, -7,  1,  5,  1, -8,  5, -4], dtype=int64)

In [53]:
np.sign(np.diff(ser))

array([ 1, -1,  1,  1,  1, -1,  1, -1], dtype=int64)

In [55]:
# Solution
#Calculate the n-th discrete difference along the given axis.
dd = np.diff(np.sign(np.diff(ser)))
dd

array([-2,  2,  0,  0, -2,  2, -2], dtype=int64)

In [56]:
peak_locs = np.where(dd == -2)[0] + 1
peak_locs

array([1, 5, 7], dtype=int64)

## 29. How to replace missing spaces in a string with the least frequent character?

In [57]:
my_str = 'dbc deb abed gade'

In [59]:
# Solution
ser = pd.Series(list('dbc deb abed gade'))
print(ser)

0     d
1     b
2     c
3      
4     d
5     e
6     b
7      
8     a
9     b
10    e
11    d
12     
13    g
14    a
15    d
16    e
dtype: object


In [60]:
freq = ser.value_counts()
print(freq)


d    4
     3
e    3
b    3
a    2
g    1
c    1
dtype: int64


In [64]:
least_freq = freq.dropna().index[-1]
least_freq 


'c'

In [65]:
"".join(ser.replace(' ', least_freq))

'dbccdebcabedcgade'

## 30. How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (saturdays) after that having random numbers as values?

In [66]:
ser = pd.Series(np.random.randint(1,10,10), pd.date_range('2000-01-01', periods=10, freq='W-SAT'))
ser

2000-01-01    7
2000-01-08    7
2000-01-15    1
2000-01-22    3
2000-01-29    3
2000-02-05    4
2000-02-12    1
2000-02-19    9
2000-02-26    6
2000-03-04    6
Freq: W-SAT, dtype: int32