#  <font color=green>Data Wrangling</font>   

### Table of Contents
- 1. [Reading Data from different type of files](#section1)</br>
- 2. [Select Series from Dataframe](#section2)</br>
- 3. [Common Methods](#section3)</br>
- 4. [Profilling](#section4)</br>
- 5. [Renaming Column Names](#section5)</br>
- 6. [Removing columns or rows from DataFrame](#section6)</br>
- 7. [Sort Dataframe and Series](#section7)</br>
- 8. [Filter Records from Dataframe](#section8)</br>
- 9. [Iterating Series or Dataframe](#section9)</br>
- 10. [String Methods](#section10)</br>

In [4]:
# Importing pandas
import pandas as pd

In [None]:
# Display all the columns with horizontal scroll
pd.set_option('display.max_columns', 100) 

# now read the file with columns names upto 100 using pd.read_csv or pd.read_table

In [42]:
# Reading USER csv file with Pipe separated
usersUrl='http://bit.ly/movieusers'
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
user=pd.read_table(usersUrl,sep='|',header=None,names=user_cols)
user.head(2)

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043


In [132]:
# Reading a subset of USER csv using usecols
usersUrl='http://bit.ly/movieusers'
user_cols_subset = ['user_id', 'gender', 'occupation']
user_subset=pd.read_table(usersUrl,sep='|',header=None,names=user_cols, usecols=user_cols_subset)
user_subset.head(2)

Unnamed: 0,user_id,gender,occupation
0,1,M,technician
1,2,F,other


In [133]:
# Reading a subset with position of columns
col_position=[1,2,4]
user_subset1=pd.read_csv(usersUrl,sep='|',header=None,names=user_cols, usecols=col_position)
user_subset1.head(2)

Unnamed: 0,age,gender,zip_code
0,24,M,85711
1,53,F,94043


In [8]:
# Reading UFO csv file with , separated
ufoUrl='http://bit.ly/uforeports'
ufo=pd.read_table(ufoUrl,sep=',')
ufo.head(2)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00


In [9]:
# Reading MOVIES csv file
moviesUrl='http://bit.ly/imdbratings'
movies=pd.read_csv(moviesUrl)
movies.head(2)

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"


In [10]:
# Reading ORDERS csv file with tab separated
ordersUrl='http://bit.ly/chiporders'
orders=pd.read_table(ordersUrl)
orders.head(2)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39


In [11]:
# Reading DRINKS csv file
drinksUrl='http://bit.ly/drinksbycountry'
drinks=pd.read_csv(drinksUrl)
drinks.head(2)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe


In [12]:
# Rading TITANIC csv file
titanicUrl='http://bit.ly/kaggletrain'
titanic=pd.read_csv(titanicUrl)
titanic.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


<a id=section2></a> 
## <font color=blue>2.Select Series from Dataframe</font>

- Each column in Dataframe is a series.
- 1st Method is not used when we have space in column names
- column name is case sensitive when selecting a series from dataframe
- we can combine or concatinate series object of same type like object-object

In [15]:
# Method #1
user.age

# Method #2
user['age']

In [22]:
# Create a new column from dataframe using series
user['Gender-Occupation']= user.gender + '-' + user.occupation

<a id=section3></a>
## <font color=blue>3. Common Methods</font>

In [25]:
# show the shape of dataframe
user.shape

In [None]:
# show the descriptive statistic fo numerical data
user.describe()
# show statistic of all column
user.describe(include='all')

In [None]:
# show the Data Type of all column
user.dtypes

In [28]:
# show the column names of a dataframe
user.columns

In [None]:
#show top and botton rows of dataset
user.head(5)
user.tail(5)

In [None]:
# show the count of each item in a column
user['age'].value_counts()

In [None]:
# show the type and count of null value in all columns
user.info()

In [None]:
# show the correlation of numerical columns
user.corr()

In [None]:
# display random sample of 10 rows from dataset
user.sample(10)

In [None]:
# show the sum of null values in eacch columns
user.isnull().sum()

<a id=section4></a>
## <font color=blue>4. Profilling</font>

In [None]:
# pre-profilling
preprofile = pandas_profiling.ProfileReport(user)
preprofile.to_file(outputfile="user_before_processing.html")

In [None]:
# post-profilling
postprofile = pandas_profiling.ProfileReport(user)
postprofile.to_file(outputfile="user_after_processing.html")

<a id=section5></a>
## <font color=blue>5. Renaming Column Names</font>

- inplace=True is used to make the changes in the Dataframe

In [32]:
# Rename one or two column names
user.rename(columns={'age':'Age','gender':'Gender'}, inplace=True)

In [39]:
# Rename multiple or all the columns
user_cols=['User_Id','Age','Gender','Occupation','Zip_Code']
user.columns=user_cols

In [70]:
# Rename columns while reading or changing columns names
user_col=['User Id','Age','Gender','Occupation','Zip Code']
userUrl='http://bit.ly/movieusers'
user=pd.read_csv(userUrl, names=user_col,sep='|')

In [63]:
# Replacing spaces with underscore in columns
user.columns=user.columns.str.replace(' ','_')

<a id=section6></a>
## <font color=blue>6. Removing columns or rows from DataFrame</font>

- axis=0            It will remove rows from data frame
- axis=1            It will remove columns from data frame
- inplace=True      It will affect the data frame

In [71]:
# Remove one column
user.drop('Age',inplace=True,axis=1)

In [54]:
# Remove multiple columns
col_drop_list=['User_Id','Gender','Occupation']
user.drop(col_drop_list,inplace=True,axis=1)
# OR
user.drop(['Gender','Occupation'],axis=1,inplace=True)

In [59]:
# remove rows 0 and 1
rows=[0,1]
user.drop(rows,axis=0,inplace=True)

<a id=section7></a>
## <font color=blue>7. Sort Dataframe and Series</font>

- Sorting is done with numbers first and then with alphanumeric

In [77]:
# Sort values of a column
movies.title.sort_values()
        # OR
movies['title'].sort_values()

In [79]:
# sort dataframe with one columns
movies.sort_values('title')

In [83]:
# Sort dataframe with decending rating
movies.sort_values('star_rating', ascending=False)
    # OR
movies.sort_values(by='star_rating', ascending=False)

# Sort Dataframe with ascending rating
movies.sort_values('star_rating',ascending=True)

In [86]:
# Sort dataframe with 2 columns
movies.sort_values(['star_rating','duration'],ascending=True)
    # OR
movies.sort_values(by=['star_rating','duration'],ascending=True)

In [88]:
# Sort ascending with one column and decending with other column
movies.sort_values(['star_rating','duration'],ascending=[True,False])
    # OR
movies.sort_values(by=['star_rating','duration'],ascending=[True,False])

<a id=section8></a>
## <font color=blue>8. Filter records from dataframe</font>

- **.loc** is used to filter with column names of dataset
- **.iloc** is used to filter with index of dataset

In [92]:
# Filter records where duration >=200
movies[movies.duration >=200]
    # OR
movies[movies['duration']>=200]

# Filter column where duration >=200
movies[movies.duration >=200]['star_rating']
    # OR
movies[movies.duration >=200].star_rating

In [95]:
# Best Way to filter records
movies.loc[movies.duration>=200,'star_rating']
    # OR
movies.loc[movies.duration>=200]

In [110]:
# Filter records based on index 

# return 1-4 rows and first columns
movies.iloc[1:5,1:2]

# return 1-10 rows and all columns
movies.iloc[1:11,:]

# return 0-7 rows and start to 3 columns
movies.iloc[:8,:3]

# return last columns
movies.iloc[:5,-1:]

# return last 2 columns
movies.iloc[:5,-2:]

In [117]:
#Filter records based on column names

# return duration column
movies.loc[:,'duration']

# return columns upto duration
movies.loc[:,:'duration']

# return columns between title and genre
movies.loc[:,'title':'genre']

In [121]:
# filter dataframe with multiple condition                                  SLOW FILTERING

# AND CONDITION
movies[(movies.duration>=200) & (movies.genre=='Crime')]
# OR CONDITION
movies[(movies.duration>=200) | (movies.genre=='Crime')]

# filter dataframe for 2 genre i.e. Crime and Adventure                     FAST FILTERING
movies[movies.genre.isin(['Crime','Adventure'])]

<a id=section9></a>
## <font color=blue> 9. Iterating Series or Dataframe</font>

In [None]:
# iterate through a column
for genre in movies.genre:
    print(genre)

In [None]:
# iterate through all rows and get index and row
for index,row in movies.iterrows():
    print(index,row.title,row,duration)

In [154]:
# iterate through 5 rows and get index and row
for index,row in movies.head().iterrows():
    print(index,row)
    # OR
for index,row in movies.iterrows():
    print(index,row['duration'],row['star_rating'])
    # OR
for index,row in movies.iterrows():
    print(index,row[0],row[1],row[2])

In [None]:
# iterate through rows using itertuples function
for row in movies.itertuples():
    print(row.Index,row.duration,row.star_rating)

In [145]:
# iterate over rows and get all values of columns as series using iteritems function
for key,value in movies.iteritems():
    print(key, value)

In [151]:
# iterating over columns and display 3 row
col=list(movies)
for i in col:
    print(movies[i][2])

<a id=section10></a>
## <font color=blue>10. String Methods</font>

In [158]:
# check substring is present in column
movies[movies.actors_list.str.contains('Marlon Brando')]

In [160]:
# use of replace method
movies.actors_list.str.replace(',','|')