This notebook is a one-stop shop of basic Python code syntax. It's a living document through the Python Foundations course.

The purpose of this is so you have this as a single point of syntax reference (for copy-pasting essentially) whenever you are stuck with writing code yourself. 

As an exercise, read through each code block and just try to understand what is happening with each piece of code. Do not try to memorize the code. Just try to become familiar with the syntax.

## Table Of Contents:

* [Data Types](#bullet-1)
** [Numeric](#bullet-1.1)
** [String](#bullet-1.2)
** [Boolean](#bullet-1.3)
** [Data Type Conversions](#bullet-1.4)
* [Data Collection Types (includes indexing, slicing and looping)](#bullet-2)
** [Tuples and Lists](#bullet-2.1)
** [Sets](#bullet-2.2)
** [Dictionaries](#bullet-2.3)
* [Functions](#bullet-3)
* [Pandas Dataframe - to be updated](#bullet-4)
** [Show Basic Table Info](#bullet-4-1)
** [Indexing and Slicing](#bullet-4-2)
** [Filtering](#bullet-4-3)
** [Editing](#bullet-4-4)
** [Merging](#bullet-4-5)

# Data Types <a class="anchor" id="bullet-1"></a>

Data belongs to different categories. It can be numeric, text, true/false, complex numbers, date, time etc. Data should be stored correctly for us to be able to do the desired operations on the data. If the data is stored incorrectly, for example numerical data is stored as text data, we can not do numerical or arithmetic operations on the data because the methods and the operations for a data depend largely on the type of data (object) being referred to.

There are various data types in Python. Listed below are the most prominently used ones:
1. Numeric - This is further divided into 
    (a.) Integer
    (b.) Float
    
2. String

3. Boolean 

### Numeric Data Type <a class="anchor" id="bullet-1.1"></a>

As the name suggests, Numeric data type consists of numbers. These numbers can be whole numbers, decimal numbers or complex numbers.

Whole numbers are stored in **Integer** (Int) type.

Decimal numbers are stored in **Float** (Float64) type.

You can check the data type of an object using the type() command.

In [172]:
#Let's look at some examples.

Integer = 7
type(Integer)

int

In [173]:
3*3

9

In [174]:
mean_heights_sydney = 54
mean_heights_adelaide = 23

(mean_heights_sydney + mean_heights_adelaide)/2

38.5

In [175]:
def fi(x):
    '''this is a function that takes in a value of x (an int) and retuns a string'''
    
    return str(x)

In [176]:
Float = 2.356
type(Float)

float

In [177]:
Complex = 3.908J
type(Complex)

complex

In [178]:
#first run the def fi(x)... code block. This way, the function fi is stored in your kernel session
#so when you say fi (like below), it knows you are referring to the function defined above
fi(2)

#notice the output is in quotes. Do you recognize this data type? If not, see below

'2'

### String Data Type <a class="anchor" id="bullet-1.2"></a>

String Data type usually is used to store text. The data to be stored in this data type is enclosed between single ('') or double ("") quotes.
Recall that you printed your name in the previous Notebook. That was string data type.
Let's look at an example.

In [179]:
# Printing your name

My_Name = "My name is Jupyter!"
My_Name

'My name is Jupyter!'

In [180]:
# works with single quotes as well. No difference from above.

My_Name = 'My name is Jupyter!'
My_Name

'My name is Jupyter!'

In [181]:
type(My_Name)

str

Various functions can be performed using strings like searching within a string, conversion to lowercase/uppercase, count, length, splitting, replacing, trimming, partitioning etc.

In [182]:
# Let's check how many characters does your name contain.
len(My_Name)

19

In [183]:
# Let's see if My_Name is all Caps or not. 
# isupper() returns True if all the letters are Capitals, False if atleast one letter in in lower case.

My_Name.isupper()

False

In [184]:
# Let's convert My_Name to all caps.

My_Name = My_Name.upper()
My_Name

'MY NAME IS JUPYTER!'

In [185]:
# Now the output of isupper() changes to True.
My_Name.isupper()

True

In [186]:
My_Name[0]

'M'

In [187]:
My_Name[1]

'Y'

In [188]:
My_Name[0:2]

'MY'

In [189]:
My_Name[0:4]

'MY N'

### Boolean Data Type <a class="anchor" id="bullet-1.3"></a>

The boolean data type has just two values, i.e., True or False.

In [190]:
# Let's look at examples.

x = True
y = False

In [191]:
type(x)

bool

In [192]:
type(y)

bool

In [193]:
if x == True:
    print("Ok, makes sense")
else:
    print("this is wierd")

Ok, makes sense


In [194]:
#for bool variables, true = 1, false = 0
if x == 1:
    print("True is also 1")

True is also 1


### Data Type Conversions <a class="anchor" id="bullet-1.4"></a>

As explained earlier, data should be stored in correct form so that it can be manipulated efficiently later.
Quite a lot of times, data is not stored correctly or it gets imported incorrectly.
In such cases, data needs to be converted to its correct type so that it can be optimally used in our analysis.
Thus, let's look at data type conversion exercises to get you comfortable with the process.

In [195]:
# Consider an object containing an integer.
string1 = 1
# I want to use the value 1 as a text value. So let's convert it.
string1 = str(string1)
type(string1)

str

In [196]:
# Let's do the opposite now. Let's learn to convert string to float now.
x = '3'
x = float("2")
type(x)

float

In [197]:
int1 = "500"
type(int1)

str

In [198]:
int1 = int(int1)
type(int1)

int

## Data Collection Types (with looping, indexing and slicing) <a class="anchor" id="bullet-2"></a>

### Tuples and Lists <a class="anchor" id="bullet-2.1"></a>

Tuples are defined with (). Lists are defined with [].

Both accept any data type (i.e. float, string, int etc)

In [199]:
random_tuple = ("da", 32,5.3, [3,4]) #notice here I'm putting a list inside a tuple!
random_tuple

('da', 32, 5.3, [3, 4])

In [200]:
tuple_months = ('January','February','March','April','May','June',\
'July','August','September','October','November','December')

In [201]:
tuple_months

('January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December')

In [202]:
tuple_months[3]

'April'

In [203]:
list_cats = ['Tom', 'Snappy', 'Kitty', 'Jessie', 'Chester', ['Beck', "Dairy"]]

In [204]:
list_cats

['Tom', 'Snappy', 'Kitty', 'Jessie', 'Chester', ['Beck', 'Dairy']]

In [205]:
print(list_cats[2])

Kitty


In [206]:
print(list_cats[0])

Tom


In [207]:
print(list_cats[0:3])

['Tom', 'Snappy', 'Kitty']


In [208]:
list_cats.append('Catherine')

In [209]:
list_cats

['Tom', 'Snappy', 'Kitty', 'Jessie', 'Chester', ['Beck', 'Dairy'], 'Catherine']

In [210]:
del list_cats[1]

In [211]:
list_cats

['Tom', 'Kitty', 'Jessie', 'Chester', ['Beck', 'Dairy'], 'Catherine']

In [212]:
index_locator = 0
for cats in list_cats:
    if cats == "Kitty":
        print("Found the index! It is", index_locator)
        break #the break command ends the for loop. Hence, as soon as this condition is satisfied, the loop is exited. Even if the loop did not reach the end of the list
    else:
        index_locator = index_locator + 1 #add one to the variable

Found the index! It is 1


In [213]:
list_cats[index_locator]

'Kitty'

In [214]:
list_cats[1]

'Kitty'

In [215]:
list_cats[4]
#what is the data collection type of this?

['Beck', 'Dairy']

In [216]:
type(list_cats[4])

list

In [217]:
list_cats[4][0]

'Beck'

### Sets <a class="anchor" id="bullet-2.2"></a>

Sets are represented between curly brackets {}. Not to be confused with dictionaries which are also represented with curly brackets, but they have a : symbol between each key-value pair.

In [218]:
my_set = {1, 2, 3}

In [219]:
your_set = {4, 2, 5}

In [220]:
my_set | your_set

{1, 2, 3, 4, 5}

In [221]:
my_set & your_set

{2}

In [222]:
my_set - your_set

{1, 3}

### Dictionaries <a class="anchor" id="bullet-2.3"></a>

Dictionaries are represented as {key:value}

In [223]:
CO2_by_year = {1799:1, 1800:70, 1801:74, 1802:82, 1902:215630, 2002:1733297}

In [224]:
CO2_by_year[1801]

74

In [225]:
CO2_by_year = {1799:1, 1800:70, 1801:74, 1802:82, 1902:215630, 2002:1733297}

In [226]:
# Look up the emissions for the given year
CO2_by_year[1801]

74

In [227]:
# Add another year to the dictionary
CO2_by_year[1950] = 734914

In [228]:
CO2_by_year

{1799: 1,
 1800: 70,
 1801: 74,
 1802: 82,
 1902: 215630,
 2002: 1733297,
 1950: 734914}

In [229]:
CO2_by_year[2009] = 1000000
CO2_by_year[2000] = 100000

In [230]:
1950 in CO2_by_year

True

In [231]:
len(CO2_by_year)

9

In [232]:
del CO2_by_year[1950]

In [233]:
len(CO2_by_year)

8

In [234]:
for key in CO2_by_year:
    print(key)

1799
1800
1801
1802
1902
2002
2009
2000


In [235]:
for k in CO2_by_year.keys():
    print(k)

1799
1800
1801
1802
1902
2002
2009
2000


In [236]:
for v in CO2_by_year.values():
    print(v)

1
70
74
82
215630
1733297
1000000
100000


In [237]:
CO2_by_year.values()

dict_values([1, 70, 74, 82, 215630, 1733297, 1000000, 100000])

In [238]:
for key, value in CO2_by_year.items():
    print(key, value)

1799 1
1800 70
1801 74
1802 82
1902 215630
2002 1733297
2009 1000000
2000 100000


# Functions <a class="anchor" id="bullet-3"></a>

In [239]:
#You can name the function anything. In this case, we name it convert_to_celsius
def convert_to_celsius(fahrenheit):
    ''' (number) -> number
    Return the celsius degrees equivalent to
    fahrenheit degrees.
    '''
    celsius = (fahrenheit - 32) * 5 / 9
    return celsius #this is the returned output by the function

In [240]:
convert_to_celsius(32)

0.0

In [241]:
#if you give variable type of string, of course, it wouldn't work. 
# Uncomment the code below by removing the hash # sign. Then run. 
# now when you see a TypeError error message, you'll know what it means.



# convert_to_celsius('3')

In [242]:
convert_to_celsius(212)

100.0

In [243]:
convert_to_celsius(-40)

-40.0

In [244]:
#You can store the value returned back by the function into a variable

returned_value_variable = convert_to_celsius(-40)
returned_value_variable

-40.0

# Pandas Dataframe <a class="anchor" id="bullet-4"></a>

In [245]:
#make sure you import the pandas library in your notebook
import pandas as pd 
import seaborn as sns

In [246]:
#you can give your dataframe object (i.e. table) any name. In this case, we use data.
data = sns.load_dataset("iris")

In [247]:
#the pandas dataframe is one of the most powerful tabular data processing tools out there for data science
type(data)

pandas.core.frame.DataFrame

## Basic Table Info <a class="anchor" id="bullet-4-1"></a>

In [248]:
#check out the pandas head documentation here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html
#notice how in the paramaters, it expects an int type variable. This is why knowing the data types is important.
# if you don't give it an int, the .head method will throw an error.


#defaults to first five rows. But you can give another int variable inside the brackets
#you could also just say data.head(2), but I've followed the documentation exactly for clarity.
data.head(n=2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa


In [249]:
# Looking at the last few rows of the data frame
#can you find the pandas official documentation for the tail method? What parameters/arguments does it expect? 
data.tail(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In [250]:
#number of rows,columns
#don't know the sytax for a tabular operation? Google the question along with "pandas dataframe", and you'll likely find the syntax on stackoverflow 
# for example here: https://stackoverflow.com/questions/13921647/python-dimension-of-data-frame

data.shape

(150, 5)

In [251]:
#which column is missing from here?
# it's because it is not a type numeric column/series
data.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [252]:
#Each column has a data type. 
# "object" = string. You've seen the float type before.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [253]:
#print column names. Notice the data type is an Index.
data.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [254]:
#Index data type can be indexed (selected) just like lists !
data.columns[1]

'sepal_width'

In [255]:
#slicing (i.e. selecting multiple values). Notice that it returns back an index. 
data.columns[0:2]


Index(['sepal_length', 'sepal_width'], dtype='object')

In [256]:
#We know this returns Index data collection type. But we can change this back to list. This should look familiar.
data.columns[0:2].tolist()

['sepal_length', 'sepal_width']

In [257]:
#a dataframe is comprised of columns that are called Series. 
# Series is another data collection type where we can store data (like lists, tuples etc.)
type(data["sepal_length"])

pandas.core.series.Series

In [258]:
data.mean(axis = 0)

sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

In [259]:
#the series object comes with several pre-programmed methods.
# check out all the available methods here: https://pandas.pydata.org/docs/reference/api/pandas.Series.nunique.html
data["sepal_length"].mean()


5.843333333333335

In [260]:
#another series method displaying the number of unique values in column species, in this case 3.
data["species"].nunique()

3

In [261]:
#Displays all the unique values of the column species along with its count
data["species"].value_counts()

virginica     50
versicolor    50
setosa        50
Name: species, dtype: int64

In [262]:
#you can access column names with dot as well. But my advise is to always use the data["col_name"] notaltion, because it also applies to the general case of slicing more than one column data["col_name_1", "col_name_2" ...]
data.sepal_length.mean()

5.843333333333335

In [263]:
#get the average for each species. Experiement with other .agg arguments like median, count
#this dataset was compiled to see if given certain flower measurements, we can guess the species of the iris flower.

#What do you think by looking at this table?

data.groupby("species").agg(["mean"])

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
Unnamed: 0_level_1,mean,mean,mean,mean
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
setosa,5.006,3.428,1.462,0.246
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


## Indexing and Slicing <a class="anchor" id="bullet-4-2"></a>

In [264]:
#Things you learned about indexing and slicing lists applies to Series as well!

#create a variable
series_is_like_list = data["sepal_length"]

In [265]:
type(series_is_like_list)

pandas.core.series.Series

In [266]:
#indexing
series_is_like_list[0]

5.1

In [267]:
#slicing. 
series_is_like_list[0:2]

0    5.1
1    4.9
Name: sepal_length, dtype: float64

In [268]:
#Notice that it returns a Series.
type(series_is_like_list[0:2])

pandas.core.series.Series

In [269]:
#We can change this to list. This should look familiar again.

series_is_like_list[0:2].to_list()

[5.1, 4.9]

In [270]:
#change type to Numpy array
series_is_like_list[0:2].to_numpy()

array([5.1, 4.9])

In [271]:
data[["sepal_length", "sepal_width"]].head(2)

Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0


In [272]:
#use double [[]] bracket to return dataframe.
type(data[["sepal_length"]])

pandas.core.frame.DataFrame

In [273]:
data[["sepal_length"]].head(2)

Unnamed: 0,sepal_length
0,5.1
1,4.9


In [274]:
data.head(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa


In [275]:
# iloc/loc notation can be used to select and change values in dataframe.

#syntax for indexing (selecting one value) /  slicing (selecting multiple values)
# data.iloc[row index/slice, column index/slice]


#first row, second column. Remember, numbering always starts with zero
data.iloc[0,1]

3.5

In [276]:
data.loc[0,"sepal_width"]

3.5

In [277]:
data.iloc[0:2, 0:3]

Unnamed: 0,sepal_length,sepal_width,petal_length
0,5.1,3.5,1.4
1,4.9,3.0,1.4


In [278]:
data.loc[0:2, "sepal_length":"petal_length"]

Unnamed: 0,sepal_length,sepal_width,petal_length
0,5.1,3.5,1.4
1,4.9,3.0,1.4
2,4.7,3.2,1.3


In [279]:
#selecting specific rows. Returns back a dataframe
data.iloc[[1,3,4], 0:3]

Unnamed: 0,sepal_length,sepal_width,petal_length
1,4.9,3.0,1.4
3,4.6,3.1,1.5
4,5.0,3.6,1.4


In [280]:
#selecting specific columns. Returns back a dataframe
data.loc[[1,3,4], ["sepal_length", "petal_length"]]

Unnamed: 0,sepal_length,petal_length
1,4.9,1.4
3,4.6,1.5
4,5.0,1.4


In [281]:
data.head(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa


## Filtering <a class="anchor" id="bullet-4-4"></a>

In [282]:
#only the entries where speal_lenght > 4.7. It returns back a dataframe
data[data["sepal_length"] > 4.7]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
7,5.0,3.4,1.5,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [283]:
data["species"].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [284]:
#filter a categorical column using a custom list
sample_list = ["virginica", "versicolor"]
data[data["species"].isin(sample_list)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [285]:
data[data["species"].isin(sample_list)]["species"].unique()

array(['versicolor', 'virginica'], dtype=object)

## Editing Values (iloc/loc) <a class="anchor" id="bullet-4-4"></a>

In [286]:
#Like lists, series are also mutable. But its good practice to always use loc/iloc notation to change values.

#notice the warning message if we don't use loc/iloc notation. We will talk more about this in class.
data["sepal_length"][0] = 4.1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["sepal_length"][0] = 4.1


In [287]:
#lets now change the value back to the original value using iloc
data.iloc[0,0] = 5.1

In [288]:
#change done. No warning message from above.
data.head(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa


In [289]:
data.head(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa


In [290]:
#sorting

#inplace = True means you want to update the dataframe. Otherwise, it simply prints the result without modifying the dataframe
#ascending = False means you want the last entry first. Similar to ORDER BY DESC in SQL.

data.sort_values(by = ["species"] ,inplace = True, ascending = False)

In [291]:
#add mean as a column
#takes the mean petal_length of each species, adds it as a new column in original dataframe

data["mean_species_petal_length"] = data.groupby("species")["petal_length"].transform("mean")

In [292]:
#it is not efficient to use for loop if you want to do some transformation to a dataframe.
#here is a code snippet you can use to modify a dataframe as per your own function
#in here x, represents a value within a specified dataframe column
# generic syntax df["new_col_name"] = data["column_to_do_data_processing_on"].apply(lambda x: name_of_your_function(x))

def multiply_10(x):
    return x*10

data["petal_length_times_10"] = data["petal_length"].apply(lambda x: multiply_10(x))
data.head(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,mean_species_petal_length,petal_length_times_10
149,5.9,3.0,5.1,1.8,virginica,5.552,51.0
111,6.4,2.7,5.3,1.9,virginica,5.552,53.0


In [293]:
#drop column
data.drop(["petal_length_times_10"], axis = 1, inplace = True)
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,mean_species_petal_length
149,5.9,3.0,5.1,1.8,virginica,5.552
111,6.4,2.7,5.3,1.9,virginica,5.552
122,7.7,2.8,6.7,2.0,virginica,5.552
121,5.6,2.8,4.9,2.0,virginica,5.552
120,6.9,3.2,5.7,2.3,virginica,5.552


In [294]:
#change values of only virginica species
#more efficient/performant way instead of looping is to use .apply method. 
# You can think of it as doing the same thing, going over each row of dataframe and executing your function on each row of dataframe. 
#Here, when you say row["col_name"], you are accessing the value for that column.

def my_function(row):
    if row['species'] == "virginica":
        return row["petal_length"] * 10
    else:
        return row["petal_length"]

data["petal_length_times_10"] = data.apply(my_function, axis=1)
data.head(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,mean_species_petal_length,petal_length_times_10
149,5.9,3.0,5.1,1.8,virginica,5.552,51.0
111,6.4,2.7,5.3,1.9,virginica,5.552,53.0


In [295]:
data.tail(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,mean_species_petal_length,petal_length_times_10
28,5.2,3.4,1.4,0.2,setosa,1.462,1.4
0,5.1,3.5,1.4,0.2,setosa,1.462,1.4


## Merging <a class="anchor" id="bullet-4-5"></a>

In [296]:
#merginging

df1 = pd.DataFrame({"customer":['101','102','103','104'], 
                    'category': ['cat2','cat2','cat1','cat3'],
                    'important': ['yes','no','yes','yes'],
                    'sales': [123,52,214,663]},index=[0,1,2,3])

df2 = pd.DataFrame({"customer":['101','103','104','105'], 
                    'color': ['yellow','green','green','blue'],
                    'distance': [12,9,44,21],
                    'sales': [123,214,663,331]},index=[4,5,6,7])



In [297]:
df1.head(5)

Unnamed: 0,customer,category,important,sales
0,101,cat2,yes,123
1,102,cat2,no,52
2,103,cat1,yes,214
3,104,cat3,yes,663


In [298]:
df2.head(5)

Unnamed: 0,customer,color,distance,sales
4,101,yellow,12,123
5,103,green,9,214
6,104,green,44,663
7,105,blue,21,331


In [299]:
df1.shape, df2.shape

((4, 4), (4, 4))

In [300]:
# before merging, its a good standard practice to rset_index i.e. have the row index for each dataframe start from 0
#e.g. currently df2 starts from 4

df1.reset_index(drop = True, inplace = True)
df2.reset_index(drop = True, inplace = True)


In [301]:
df2.head()

Unnamed: 0,customer,color,distance,sales
0,101,yellow,12,123
1,103,green,9,214
2,104,green,44,663
3,105,blue,21,331


In [302]:
df_all = df1.merge(df2, on = "customer", how = "inner")
df_all

Unnamed: 0,customer,category,important,sales_x,color,distance,sales_y
0,101,cat2,yes,123,yellow,12,123
1,103,cat1,yes,214,green,9,214
2,104,cat3,yes,663,green,44,663


In [303]:
df_all.shape

(3, 7)