<a href="https://colab.research.google.com/github/sureshmecad/Google-Colab/blob/master/1_str_contains_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **Pandas Series.str.contains()**

#### **Syntax: Series.str.contains(pat, case=True, flags=0, na=nan, regex=True)**

#### **Parameter :**

 - **pat :** Character sequence or regular expression.

 - **case :** If **True**, **case sensitive.**

 - **flags :** Flags to pass through to the re module, e.g. re.IGNORECASE. 

 - **na :** Fill value for **missing values.**

 - **regex :** If True, assumes the pat is a regular expression.

 - **Returns :** Series or Index of boolean values 

In [1]:
import pandas as pd
import numpy as np

In [2]:
# importing re for regular expressions
import re
 
# Creating the Series
sr = pd.Series(['New_York', 'Lisbon', 'Tokyo', 'Paris', 'Munich'])
 
# Creating the index
idx = ['City 1', 'City 2', 'City 3', 'City 4', 'City 5']
 
# set the index
sr.index = idx
 
# Print the series
print(sr)

City 1    New_York
City 2      Lisbon
City 3       Tokyo
City 4       Paris
City 5      Munich
dtype: object


### **1) Letter followed by alphabet**

In [3]:
# find if there is a substring such that it has the letter 'i' follwed by any small alphabet.
result = sr.str.contains(pat = 'i[a-z]', regex = True)
 
# print the result
print(result)

City 1    False
City 2     True
City 3    False
City 4     True
City 5     True
dtype: bool


### **2) Search pandas column with string contains and does not contain**

In [4]:
# create a dataframe
raw_data = {'name': ['Willard Morris', 'Al Jennings'],
            'age': [20, 19],
            'favorite_color': ['blue', 'red'],
            'grade': [88, 92],
            'grade': [88, 92]}

df = pd.DataFrame(raw_data, index = ['Willard Morris', 'Al Jennings'])
df

Unnamed: 0,name,age,favorite_color,grade
Willard Morris,Willard Morris,20,blue,88
Al Jennings,Al Jennings,19,red,92


#### **2.1) Search pandas column with string contains**

In [5]:
# here we can count the number of distinct users viewing on a given day
new_df = df[df['name'].str.contains('Morris', na=False)]
new_df.head()

Unnamed: 0,name,age,favorite_color,grade
Willard Morris,Willard Morris,20,blue,88


#### **2.2) Search pandas column with string does not contain**

In [6]:
# here we can count the number of distinct users viewing on a given day
new_df2 = df[~df['name'].str.contains('Morris', na=False)]
new_df2.head()

Unnamed: 0,name,age,favorite_color,grade
Al Jennings,Al Jennings,19,red,92


### **3) case = True / False**

In [7]:
my_dict = {'id':[1,2,3,4,5,6,7],
           'name':['$John','Ma51','Arnold1','Krish0','Roni','Krish','Max'],
           'class':['Four','Three','#Three','Four','7Four','Four,','%Three'],
           'mark':[75,85,55,60,60,60,85]
           }

df1 = pd.DataFrame(data = my_dict)
print(df1)

   id     name   class  mark
0   1    $John    Four    75
1   2     Ma51   Three    85
2   3  Arnold1  #Three    55
3   4   Krish0    Four    60
4   5     Roni   7Four    60
5   6    Krish   Four,    60
6   7      Max  %Three    85


In [8]:
df2 = df1.copy()
df3 = df1.copy()

- We will use **contains()** to get only rows having **ar** in **name column**.

- We used the option **case=False** so this is a **case insensitive** matching.

- You can make it **case sensitive** by changing case option to **case=True**

In [9]:
ar = df1[df1['name'].str.contains('ar', case=False)]
ar

Unnamed: 0,id,name,class,mark
2,3,Arnold1,#Three,55


### **4) regex=True | False**

- We can use regular expression pattern matching by setting the option **regex=True**.

- We will collect rows where **name** column is **starting with A or K**

In [10]:
AK = df1[df1['name'].str.contains('^[AK]', case=True, regex=True)]
AK

Unnamed: 0,id,name,class,mark
2,3,Arnold1,#Three,55
3,4,Krish0,Four,60
5,6,Krish,"Four,",60


#### **a) Name column ending with h**

In [11]:
h = df1[df1['name'].str.contains('h$', case=True, regex=True)]
h

Unnamed: 0,id,name,class,mark
5,6,Krish,"Four,",60


#### **b) Name column ending with h or n**

In [12]:
HN = df1[df1['name'].str.contains('[hn]$', case=True, regex=True)]
HN

Unnamed: 0,id,name,class,mark
0,1,$John,Four,75
5,6,Krish,"Four,",60


#### **c) Name column not having ar**

In [13]:
AR = df1[~df1['name'].str.contains('ar', case=False)]
AR

Unnamed: 0,id,name,class,mark
0,1,$John,Four,75
1,2,Ma51,Three,85
3,4,Krish0,Four,60
4,5,Roni,7Four,60
5,6,Krish,"Four,",60
6,7,Max,%Three,85


In [14]:
AR1 = df1[df1['name'].str.contains('^((?!ar).)*$', case=True, regex=True)]
AR1

  return func(self, *args, **kwargs)


Unnamed: 0,id,name,class,mark
0,1,$John,Four,75
1,2,Ma51,Three,85
2,3,Arnold1,#Three,55
3,4,Krish0,Four,60
4,5,Roni,7Four,60
5,6,Krish,"Four,",60
6,7,Max,%Three,85


#### **d) Display all rows where class column is having special chars**

In [15]:
print(df1[df1['class'].str.contains(r'[@#&$%+-/*]')])

   id     name   class  mark
2   3  Arnold1  #Three    55
5   6    Krish   Four,    60
6   7      Max  %Three    85


#### **e) Display all rows here name column is having number**

In [16]:
print(df1[df1['name'].str.contains('\\d', regex=True)])

   id     name   class  mark
1   2     Ma51   Three    85
2   3  Arnold1  #Three    55
3   4   Krish0    Four    60


#### **g) Display all rows where name contains 0**

In [17]:
print(df1[df1['name'].str.contains('0')] )

   id    name class  mark
3   4  Krish0  Four    60


#### **h) Display all rows where name contain 0 or class column is having special chars. ( OR combination )**

In [18]:
print(df1[df1['class'].str.contains(r'[@#&$%+-/*]') | df1['name'].str.contains('0')])

   id     name   class  mark
2   3  Arnold1  #Three    55
3   4   Krish0    Four    60
5   6    Krish   Four,    60
6   7      Max  %Three    85


#### **i) Deleting the rows matching the condition**

- In all above cases we have displayed matching rows. We can use **drop()** to **delete the matching rows and return the balance**.

- Note that **drop() will not change the main DataFrame.**

- **Deleting the rows having 0 in name column.**

In [19]:
# Output ( id 4 is deleted )
drop = df1.drop(df1[df1['name'].str.contains('0')].index)
print(drop)

   id     name   class  mark
0   1    $John    Four    75
1   2     Ma51   Three    85
2   3  Arnold1  #Three    55
4   5     Roni   7Four    60
5   6    Krish   Four,    60
6   7      Max  %Three    85


#### **j) Delete the rows having special chars in class column**

In [20]:
# Output ( id 3, 6 and 7 are deleted )
SC = df2.drop(df2[df2['class'].str.contains(r'[@#&$%+-/*]')].index)
print(SC)

   id    name  class  mark
0   1   $John   Four    75
1   2    Ma51  Three    85
3   4  Krish0   Four    60
4   5    Roni  7Four    60


#### **k) Delete the rows having special characters in class or name columns**

In [22]:
SCCN = df3.drop(df3[df3['class'].str.contains(r'[@#&$%+-/*]') | df3['name'].str.contains(r'[@#&$%+-/*]')].index)
print(SCCN)

   id    name  class  mark
1   2    Ma51  Three    85
3   4  Krish0   Four    60
4   5    Roni  7Four    60
