# Welcome to this Kernel

* This kernel is a compilation of 24 exercises with solutions from this webpage:

https://www.machinelearningplus.com/python/101-pandas-exercises-python/


# Upvote if you found it useful


<a id='table_of_contents'></a>
# Table of contents

[1. How to create a series from a list, numpy array and dict?](#q1)

[2. How to combine many series to form a dataframe?](#q2)

[3. How to get the items of series A not present in series B?](#q3)

[4. How to get the items not common to both series A and series B?](#q4)

[5. How to get useful infos](#q5)

[6. How to get frequency counts of unique items of a series?](#q6)

[7. How to convert a numpy array to a dataframe of given shape? (L1)](#q7)

[8. How to find the positions of numbers that are multiples of 3 from a series?](#q8)

[9. How to extract items at given positions from a series?](#q9)

[10. How to stack two series vertically and horizontally ?](#q10)

[11. How to get the positions of items of series A in another series B?](#q11)

[12. How to compute difference of differences between consequtive numbers of a series?](#q12)

[13. How to convert a series of date-strings to a timeseries?](#q13)

[14. How to filter words that contain atleast 2 vowels from a series?](#q14)

[15. How to replace missing spaces in a string with the least frequent character?](#q15)

[16. How to change column values when importing csv to a dataframe?](#q16)

[17. How to import only specified columns from a csv file?](#q17)

[18. How to check if a dataframe has any missing values?](#q18)

[19. How to replace missing values of multiple numeric columns with the mean?](#q19)

[20. How to change the order of columns of a dataframe?](#q20)

[21. How to filter every nth row in a dataframe?](#q21)

[22. How to get the last n rows of a dataframe with row sum > 100?](#q22)

[23. How to find and cap outliers from a series or dataframe column?](#q23)

[24. How to reverse the rows of a dataframe?](#q24)


In [None]:
# Allow several prints in one cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd

# Pandas exercise

<a id = 'q1'></a>

**1. How to create a series from a list, numpy array and dict?**

Create a pandas series from each of the items below: a list, numpy and a dictionary

[Go back to the table of contents](#table_of_contents)

In [None]:
# Input
import numpy as np
a_list = list("abcdefg")
numpy_array = np.arange(1, 10)
dictionary = {"A":  0, "B":1, "C":2, "D":3, "E":5}

In [None]:
series1 = pd.Series(a_list)
print(series1)
series2 = pd.Series(numpy_array)
print(series2)
series3 = pd.Series(dictionary)
print(series3)

<a id = 'q2'></a>
**2. How to combine many series to form a dataframe?**

Combine ser1 and ser2 to form a dataframe.

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series(np.arange(26))

In [None]:
# using pandas DataFrame
ser_df = pd.DataFrame(ser1, ser2).reset_index()
ser_df.head()
# using pandas DataFrame with a dictionary, gives a specific name to the column
ser_df = pd.DataFrame({"col1":ser1, "col2":ser2})
ser_df.head(5)
# using pandas concat
ser_df = pd.concat([ser1, ser2], axis = 1)
ser_df.head()

<a id = 'q3'></a>
**3. How to get the items of series A not present in series B?**

Get all items of ser1 and ser2 not common to both.

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

In [None]:
ser1[~ser1.isin(ser2)]

<a id = 'q4'></a>
**4. How to get the items not common to both series A and series B?**

Get all items of ser1 and ser2 not common to both.

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

In [None]:
# using pandas
a_not_b = ser1[~ser1.isin(ser2)]
b_not_a = ser2[~ser2.isin(ser1)]
                          
a_not_b.append(b_not_a, ignore_index = True)

# using numpy union and intersection
ser_u = pd.Series(np.union1d(ser1, ser2))
ser_i = pd.Series(np.intersect1d(ser1, ser2))
ser_u[~ser_u.isin(ser_i)]

<a id = 'q5'></a>
**5. How to get useful infos**

Compute the minimum, 25th percentile, median, 75th, and maximum of ser.

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
state = np.random.RandomState(100)
ser = pd.Series(state.normal(10, 5, 25))


In [None]:
# using pandas
ser.describe()

<a id = 'q6'></a>
**6. How to get frequency counts of unique items of a series?**

Calculate the frequency counts of each unique value ser.

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))

In [None]:
ser.value_counts()

<a id = 'q7'></a>
**7. How to convert a numpy array to a dataframe of given shape? (L1)**

Reshape the series ser into a dataframe with 7 rows and 5 columns

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
ser = pd.Series(np.random.randint(1, 10, 35))
ser

In [None]:
# using numpy
pd.DataFrame(np.array(ser).reshape(7, 5))

# using only pandas
pd.DataFrame(ser.values.reshape(7, 5))

<a id = 'q8'></a>
**8. How to find the positions of numbers that are multiples of 3 from a series?**

Find the positions of numbers that are multiples of 3 from ser.

[Go back to the table of contents](#table_of_contents)

In [None]:
# input

np.random.RandomState(100)
ser = pd.Series(np.random.randint(1, 5, 10))
ser

In [None]:
# using the where clause
ser.where(lambda x: x%3 == 0).dropna()

# using numpy and reshape to get a pandas series
#pd.Series(np.argwhere(ser%3 == 0).reshape(4))
np.argwhere(ser%3 == 0)

<a id = 'q9'></a>
**9. How to extract items at given positions from a series**

From ser, extract the items at positions in list pos.

[Go back to the table of contents](#table_of_contents)

In [None]:
# input

ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]

In [None]:
# using loc
ser.loc[pos]

# using series take
ser.take(pos)

<a id = 'q10'></a>

**10. How to stack two series vertically and horizontally ?**

Stack ser1 and ser2 vertically and horizontally (to form a dataframe).

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))

In [None]:
# vertical
ser1.append(ser2)
# or using pandas concat and axis = 0
pd.concat([ser1, ser2], axis = 0)

# horizontal
pd.concat([ser1, ser2], axis = 1)

<a id = 'q11'></a>
**11. How to get the positions of items of series A in another series B?**

Get the positions of items of ser2 in ser1 as a list.

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

In [None]:
# get's the index, but it's sorts the index
list(ser1[ser1.isin(ser2)].index)

# using numpy where
[np.where(i == ser1)[0].tolist()[0] for i in ser2]

# using pandas Index and get location
[pd.Index(ser1).get_loc(i) for i in ser2]

<a id = 'q12'></a>
**12. How to compute difference of differences between consequtive numbers of a series?**

Difference of differences between the consequtive numbers of ser.

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

# Desired Output
# [nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]
# [nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]

In [None]:
# using pandas diff()
ser.diff(periods = 1).tolist()
ser.diff(periods = 1).diff(periods = 1).tolist()

<a id = 'q13'></a>

**13. How to convert a series of date-strings to a timeseries?**

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])


'''
Desired Output

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
'''



In [None]:
# using pands to_datetime
pd.to_datetime(ser)

# using dateutil parse
from dateutil.parser import parse
ser.map(lambda x: parse(x))

<a id = 'q14'></a>

**14. How to filter words that contain atleast 2 vowels from a series?**

From ser, extract words that contain atleast 2 vowels.

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

'''
Desired Output


0     Apple
1    Orange
4     Money
dtype: object
'''

In [None]:
# using nested loops
vowels = list("aeiou")
list_ = []
for w in ser:
    c = 0
    for l in list(w.lower()):
        if l in vowels:
            c += 1
    if c >= 2:
        print(w)
        list_.append(w)

ser[ser.isin(list_)]

# another solution using counter

from collections import Counter
mask = ser.map(lambda x: sum([Counter(x.lower()).get(i, 0) for i in list('aeiou')]) >= 2)
ser[mask]

<a id = 'q15'></a>

**15. How to replace missing spaces in a string with the least frequent character?**

Replace the spaces in my_str with the least frequent character.

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
my_str = 'dbc deb abed ggade'

'''
Desired Output

'dbccdebcabedcggade'  # least frequent is 'c'
'''

In [None]:
# using Counter
from collections import Counter
my_str_ = my_str
Counter_ = Counter(list(my_str_.replace(" ", "")))
Counter_
minimum = min(Counter_, key = Counter_.get)

print(my_str.replace(" ", minimum))

# using pandas
ser = pd.Series(list(my_str.replace(" ", "")))
ser.value_counts()
minimum = list(ser.value_counts().index)[-1]
minimum
print(my_str.replace(" ", minimum))

<a id = 'q16'></a>

**16. How to change column values when importing csv to a dataframe?**

Import the boston housing dataset, but while importing change the 'medv' (median house value) column so that values < 25 becomes ‘Low’ and > 25 becomes ‘High’.

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# first let's import using the previuos code and save as a normal csv

names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
with open("/kaggle/input/boston-house-prices/housing.csv") as f:
    data = f.read()
    nth_rows = []
    for i, rows in enumerate(data.split("\n")):
        nth_rows.append(rows)

data_ = [nth_rows[i].split() for i in range(len(nth_rows))]

df = pd.DataFrame(data_, columns=names)
df.head()
df.to_csv("housing_preprocessed.csv")
del df

In [None]:
# now let's start importing as normal and use converters to convert the values
# skipfooter because we had the last rows with nan values and index_col to specify that the first column is the index
df = pd.read_csv("housing_preprocessed.csv",  index_col = 0, skipfooter=1,  converters = {"MEDV": lambda x: "HIGH" if float(x) >= 25 else "LOW"})
df

<a id = 'q17'></a>

**17. How to import only specified columns from a csv file?**

[Go back to the table of contents](#table_of_contents)

In [None]:
# input
        
# code that generates the housing_preprocessed.csv file
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
with open("/kaggle/input/boston-house-prices/housing.csv") as f:
    data = f.read()
    nth_rows = []
    for i, rows in enumerate(data.split("\n")):
        nth_rows.append(rows)

data_ = [nth_rows[i].split() for i in range(len(nth_rows))]

df = pd.DataFrame(data_, columns=names)
df.to_csv("housing_preprocessed.csv")
del df

# use the /kaggle/input/boston-house-prices/housing_preprocessed.csv file
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
file = "housing_preprocessed.csv"
# using index
df = pd.read_csv(file, usecols = [1, 2, 4], skipfooter=1)
df.head()
# using column names
df = pd.read_csv(file, usecols = ["CRIM", "ZN", "CHAS"])
df.head()

<a id = 'q18'></a>

**18. How to check if a dataframe has any missing values?**

[Go back to the table of contents](#table_of_contents)


In [None]:
# input
df = pd.read_csv("../input/cars93/Cars93.csv")
df

In [None]:
# Solution 1
print("Our df has a total of {} null values".format(df.isnull().sum().sum()))
print()

# Solution 2
df.isnull().values.any()
print()

# Solution 3
# A more detailed one
def report_nulls(df):
    '''
    Show a fast report of the DF.
    '''
    rows = df.shape[0]
    columns = df.shape[1]
    null_cols = 0
    list_of_nulls_cols = []
    for col in list(df.columns):
        null_values_rows = df[col].isnull().sum()
        null_rows_pcn = round(((null_values_rows)/rows)*100, 2)
        col_type = df[col].dtype
        if null_values_rows > 0:
            print("The column {} has {} null values. It is {}% of total rows.".format(col, null_values_rows, null_rows_pcn))
            print("The column {} is of type {}.\n".format(col, col_type))
            null_cols += 1
            list_of_nulls_cols.append(col)
    null_cols_pcn = round((null_cols/columns)*100, 2)
    print("The DataFrame has {} columns with null values. It is {}% of total columns.".format(null_cols, null_cols_pcn))
    return list_of_nulls_cols

report_nulls(df)

<a id = 'q19'></a>

**19. How to replace missing values of multiple numeric columns with the mean?**

Replace missing values in Luggage.room columns with their respective mean.

[Go back to the table of contents](#table_of_contents)


In [None]:
# input
df = pd.read_csv("../input/cars93/Cars93.csv")

In [None]:
# Solution 1
beg_null = df.isnull().sum().sum()
print(beg_null)
# notice that we have filtering the columns  as a list.
df[["Luggage.room"]] = df[["Luggage.room"]].apply(lambda x: x.fillna(x.mean()))
end_null = df.isnull().sum().sum()
print(end_null)

print("We have got rid of {} null values, filling them with the mean.".format(beg_null - end_null))

<a id = 'q20'></a>

**20. How to change the order of columns of a dataframe?**

Actually 3 questions.

1. In df, interchange columns 'a' and 'c'.

2. Create a generic function to interchange two columns, without hardcoding column names.

3. Sort the columns in reverse alphabetical order, that is colume 'e' first through column 'a' last.

[Go back to the table of contents](#table_of_contents)


In [None]:
# input
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

In [None]:
# Solution to question 1
# we pass a list with the custom names BUT THIS DOESN'T change in place
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
df[["c", "b", "a", "d", "e"]]
df

# if we reasing that this will work
df = df[["c", "b", "a", "d", "e"]]
df

# Solution to question 2
def change_cols(df, col1, col2):
    df_columns = df.columns.to_list()
    index1 = df_columns.index(col1)
    index2 = df_columns.index(col2)
    # swaping values
    df_columns[index1], df_columns[index2] = col1, col2
    
    return df[df_columns]


df = change_cols(df, "b", "e")
df
    

# Solution to question 3
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
col_list = list(df.columns)
col_list_reversed = col_list[::-1]
col_list
col_list_reversed
# using the trick from solution 1
df = df[col_list_reversed]
df


print("Solution from the website")
print("-------------------------")
# Others solution from the website

# Input
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

# Solution Q1
df[list('cbade')]

# Solution Q2 - No hard coding
def switch_columns(df, col1=None, col2=None):
    colnames = df.columns.tolist()
    i1, i2 = colnames.index(col1), colnames.index(col2)
    colnames[i2], colnames[i1] = colnames[i1], colnames[i2]
    return df[colnames]

df1 = switch_columns(df, 'a', 'c')

# Solution Q3
df[sorted(df.columns)]
# or
df.sort_index(axis=1, ascending=False, inplace=True)

<a id = 'q21'></a>

**21. How to filter every nth row in a dataframe?**

From df, filter the 'Manufacturer', 'Model' and 'Type' for every 20th row starting from 1st (row 0).

[Go back to the table of contents](#table_of_contents)


In [None]:
# input
df = pd.read_csv("../input/cars93/Cars93.csv")
df

In [None]:
# First let's import only the columns we need
df = pd.read_csv("../input/cars93/Cars93.csv", usecols=["Manufacturer", "Model", "Type"])

# Solution 1
# Using normal python slicing
df[::20]

df = pd.read_csv("../input/cars93/Cars93.csv", usecols=["Manufacturer", "Model", "Type"])

# Solution 2
# Using iloc
df.iloc[::20, :][['Manufacturer', 'Model', 'Type']]



<a id = 'q22'></a>

**22. How to get the last n rows of a dataframe with row sum > 100?**

Get the last two rows of df whose row sum is greater than 100.

[Go back to the table of contents](#table_of_contents)





In [None]:
# input
df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))
df1 = df.copy(deep = True)

In [None]:
# Solution 1
df["sum"] = df.sum(axis = 1)
df

print("The index of the rows that are greater than 100 are {}".format((df[df["sum"] > 100].index).to_list()[-2:]))

# Solution 2 using numpy
rowsums = df1.apply(np.sum, axis=1)

# last two rows with row sum greater than 100
last_two_rows = df1.iloc[np.where(rowsums > 100)[0][-2:], :]
last_two_rows


<a id = 'q23'></a>

**23. How to find and cap outliers from a series or dataframe column?**

Replace all values of ser in the lower 5%ile and greater than 95%ile with respective 5th and 95th %ile value.

[Go back to the table of contents](#table_of_contents)





In [None]:
# input
ser = pd.Series(np.logspace(-2, 2, 30))
ser1 = ser.copy(deep = True)
ser2 = ser.copy(deep = True)

In [None]:
# Solution 1
# get the quantiles values
quantiles = np.quantile(ser, [0.05, 0.95])
ser

# filter ser using numpy to know where the values are below or greater than 5% or 95% and replace the values
ser.iloc[np.where(ser < quantiles[0])] = quantiles[0]
ser.iloc[np.where(ser > quantiles[1])] = quantiles[1]
    
# or we can just do
ser1[ser1 < quantiles[0]] = quantiles[0]
ser1[ser1 > quantiles[1]] = quantiles[1]

ser1

# Solution from the webpage
def cap_outliers(ser, low_perc, high_perc):
    low, high = ser.quantile([low_perc, high_perc])
    print(low_perc, '%ile: ', low, '|', high_perc, '%ile: ', high)
    ser[ser < low] = low
    ser[ser > high] = high
    return(ser)

capped_ser = cap_outliers(ser2, .05, .95)
ser2
capped_ser

****
<a id = 'q24'></a>

**24. How to reverse the rows of a dataframe?**

Reverse all the rows of dataframe df.

[Go back to the table of contents](#table_of_contents)



In [None]:
# input
df = pd.DataFrame(np.arange(25).reshape(5, -1))

In [None]:
# Solution 1
df
df.iloc[df.index.to_list()[::-1]]

# Solutions from the webpage
# Solution 2
df.iloc[::-1, :]

# Solution 3
print(df.loc[df.index[::-1], :])

### The End