# Introduction 

Hi Folks,
We are starting the Full DataFlow Journey in which we are covering everything like ​​Importing Data, Cleaning Data, Merging, Joining, and Concatenating Data, GroupBy Operations, Reshaping and Pivoting DataFrames, Data Preparation and Feature Creation, etc.<br><br>

### Lecture Agenda
In this particular lecture we are going to learn what are the different options we have for string operations

#### Technologies Used
Python, Pandas


### Data Used
Data Professionals Salary - 2022
Salaries of Data Scientists, ML Engineers, Data Analysts, Data Engineers in 2022

### Getting Started

we will walk you through this part of the pandas' library and show you the most useful pandas string processing functions. You will learn how to use:
* upper()
* lower()
* isupper()
* slower()
* isnumeric()
* replace()
* split()
* contains()
* find()
* findall()



[Official Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/analytics-industry-salaries-2022-india/info.txt
/kaggle/input/analytics-industry-salaries-2022-india/Salary Dataset.csv


In [2]:
df=pd.read_csv("../input/analytics-industry-salaries-2022-india/Salary Dataset.csv")

In [3]:
df

Unnamed: 0,Company Name,Job Title,Salaries Reported,Location,Salary
0,Mu Sigma,Data Scientist,105.0,Bangalore,"₹6,48,573/yr"
1,IBM,Data Scientist,95.0,Bangalore,"₹11,91,950/yr"
2,Tata Consultancy Services,Data Scientist,66.0,Bangalore,"₹8,36,874/yr"
3,Impact Analytics,Data Scientist,40.0,Bangalore,"₹6,69,578/yr"
4,Accenture,Data Scientist,32.0,Bangalore,"₹9,44,110/yr"
...,...,...,...,...,...
4339,TaiyōAI,Machine Learning Scientist,1.0,Mumbai,"₹5,180/mo"
4340,Decimal Point Analytics,Machine Learning Developer,1.0,Mumbai,"₹7,51,286/yr"
4341,MyWays,Machine Learning Developer,1.0,Mumbai,"₹4,10,952/yr"
4342,Market Pulse Technologies,Software Engineer - Machine Learning,1.0,Mumbai,"₹16,12,324/yr"


In [4]:
df["Company Name"].value_counts()

Tata Consultancy Services     41
Amazon                        32
Accenture                     30
Google                        27
Fresher                       26
                              ..
URS Technologies Solutions     1
Aniket Sonawane                1
Brahman bhetun                 1
Airavaat Car Rentals           1
Market Pulse Technologies      1
Name: Company Name, Length: 2529, dtype: int64

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4344 entries, 0 to 4343
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Company Name       4341 non-null   object 
 1   Job Title          4344 non-null   object 
 2   Salaries Reported  4342 non-null   float64
 3   Location           4344 non-null   object 
 4   Salary             4344 non-null   object 
dtypes: float64(1), object(4)
memory usage: 169.8+ KB


In [6]:
df.Salary

0        ₹6,48,573/yr
1       ₹11,91,950/yr
2        ₹8,36,874/yr
3        ₹6,69,578/yr
4        ₹9,44,110/yr
            ...      
4339        ₹5,180/mo
4340     ₹7,51,286/yr
4341     ₹4,10,952/yr
4342    ₹16,12,324/yr
4343     ₹9,39,843/yr
Name: Salary, Length: 4344, dtype: object

In [7]:
df.columns

Index(['Company Name', 'Job Title', 'Salaries Reported', 'Location', 'Salary'], dtype='object')

# 1. upper()


The first function that we will discuss brings all the letters in a string to the upper case. We can apply it to the name column using the following code.

In [8]:
df["Company Name"].str.upper()

0                        MU SIGMA
1                             IBM
2       TATA CONSULTANCY SERVICES
3                IMPACT ANALYTICS
4                       ACCENTURE
                  ...            
4339                      TAIYŌAI
4340      DECIMAL POINT ANALYTICS
4341                       MYWAYS
4342    MARKET PULSE TECHNOLOGIES
4343                      VPHRASE
Name: Company Name, Length: 4344, dtype: object

# 2. lower()


Lower() function works similarly to the upper() function but it does exactly the opposite, it lowers all characters in a string. Here you can see the results of calling it on the name column.

In [9]:
df["Job Title"].str.lower()

0                             data scientist
1                             data scientist
2                             data scientist
3                             data scientist
4                             data scientist
                        ...                 
4339              machine learning scientist
4340              machine learning developer
4341              machine learning developer
4342    software engineer - machine learning
4343               machine learning engineer
Name: Job Title, Length: 4344, dtype: object

# 3. isupper()


This function can be called in the same way as upper() or lower(), following '.str' on the column. It will check every string entry in a column if it has all its characters capitalized. Let's call it on the name column again.

In [10]:
df["Job Title"].str.isupper()

0       False
1       False
2       False
3       False
4       False
        ...  
4339    False
4340    False
4341    False
4342    False
4343    False
Name: Job Title, Length: 4344, dtype: bool

# 4. islower()


This function works the same as isupper() but it checks for the opposite characteristic if all characters are lower case. 

In [11]:
df["Job Title"].str.islower()

0       False
1       False
2       False
3       False
4       False
        ...  
4339    False
4340    False
4341    False
4342    False
4343    False
Name: Job Title, Length: 4344, dtype: bool

# 5. isnumeric()


This function checks if the characters in the string are actually digits. All of them have to be digits in order for isnumeric() to return True. 

In [12]:
df["Job Title"].str.islower()

0       False
1       False
2       False
3       False
4       False
        ...  
4339    False
4340    False
4341    False
4342    False
4343    False
Name: Job Title, Length: 4344, dtype: bool

In [13]:
df["Salary"].str.islower()

0       True
1       True
2       True
3       True
4       True
        ... 
4339    True
4340    True
4341    True
4342    True
4343    True
Name: Salary, Length: 4344, dtype: bool

In [14]:
#df["Salaries Reported"].str.islower()

In [15]:
df["Salary"].str.islower()

0       True
1       True
2       True
3       True
4       True
        ... 
4339    True
4340    True
4341    True
4342    True
4343    True
Name: Salary, Length: 4344, dtype: bool

# 6. replace()


Another very useful function is replace(). It can be used to replace a part of the string with another one. Let's demonstrate how to use it on the group column.  If you remember the group column consisted of 'class 1' and 'class 2' entries.



In [16]:
df["Salary"].str.replace("₹","")

0        6,48,573/yr
1       11,91,950/yr
2        8,36,874/yr
3        6,69,578/yr
4        9,44,110/yr
            ...     
4339        5,180/mo
4340     7,51,286/yr
4341     4,10,952/yr
4342    16,12,324/yr
4343     9,39,843/yr
Name: Salary, Length: 4344, dtype: object

# 7. split()


Split() function splits a string on the desired character. It is very useful if you have a sentence and wand to get a list of individual words. You can do that by splitting the string on the empty space (' ')

In [17]:
df["Company Name"].str.split()

0                         [Mu, Sigma]
1                               [IBM]
2       [Tata, Consultancy, Services]
3                 [Impact, Analytics]
4                         [Accenture]
                    ...              
4339                        [TaiyōAI]
4340      [Decimal, Point, Analytics]
4341                         [MyWays]
4342    [Market, Pulse, Technologies]
4343                        [vPhrase]
Name: Company Name, Length: 4344, dtype: object

In [18]:
df["Company Name"].str.split("a")

0                          [Mu Sigm, ]
1                                [IBM]
2       [T, t,  Consult, ncy Services]
3                 [Imp, ct An, lytics]
4                          [Accenture]
                     ...              
4339                        [T, iyōAI]
4340       [Decim, l Point An, lytics]
4341                         [MyW, ys]
4342      [M, rket Pulse Technologies]
4343                        [vPhr, se]
Name: Company Name, Length: 4344, dtype: object

# 8. contains()


Contains() function can check if the string contains a particular substring. The function is quite similar to replace() but instead of replacing the string itself it just returns the boolean value True or False.

In [19]:
df["Company Name"].str.contains("a")

0        True
1       False
2        True
3        True
4       False
        ...  
4339     True
4340     True
4341     True
4342     True
4343     True
Name: Company Name, Length: 4344, dtype: object

# 9. find()


Find() is another function that can be very handy when cleaning your string data. This function will return an index of where a given substring is found in a string.

In [20]:
df

Unnamed: 0,Company Name,Job Title,Salaries Reported,Location,Salary
0,Mu Sigma,Data Scientist,105.0,Bangalore,"₹6,48,573/yr"
1,IBM,Data Scientist,95.0,Bangalore,"₹11,91,950/yr"
2,Tata Consultancy Services,Data Scientist,66.0,Bangalore,"₹8,36,874/yr"
3,Impact Analytics,Data Scientist,40.0,Bangalore,"₹6,69,578/yr"
4,Accenture,Data Scientist,32.0,Bangalore,"₹9,44,110/yr"
...,...,...,...,...,...
4339,TaiyōAI,Machine Learning Scientist,1.0,Mumbai,"₹5,180/mo"
4340,Decimal Point Analytics,Machine Learning Developer,1.0,Mumbai,"₹7,51,286/yr"
4341,MyWays,Machine Learning Developer,1.0,Mumbai,"₹4,10,952/yr"
4342,Market Pulse Technologies,Software Engineer - Machine Learning,1.0,Mumbai,"₹16,12,324/yr"


In [21]:
df["Job Title"].str.find("Data")

0       0
1       0
2       0
3       0
4       0
       ..
4339   -1
4340   -1
4341   -1
4342   -1
4343   -1
Name: Job Title, Length: 4344, dtype: int64

# 10. findall()


Findall() similarly to find() will search a string for existing substring but instead of one index, it will return a list of matching substrings. 

In [22]:
df["Job Title"].str.findall("Data")

0       [Data]
1       [Data]
2       [Data]
3       [Data]
4       [Data]
         ...  
4339        []
4340        []
4341        []
4342        []
4343        []
Name: Job Title, Length: 4344, dtype: object