<div style="color:#006666; padding:0px 10px; border-radius:5px; font-size:18px;"><h1 style='margin:10px 5px'>Missing Data</h1>
</div>

© Copyright Machine Learning Plus

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>1. Representing Missing Values</h2>
</div>

The usual way of representing missing values in Python is `None`.

However there are multiple ways to represent missing data in NumPy and Pandas. 

NumPy provides `np.nan` for float and objects. 

Pandas provides pd.NA for generalise missing data and pd.NaT (not a time) for missing time data.

In [None]:
import numpy as np
import pandas as pd

In [None]:
np.nan

nan

In [None]:
pd.NA

<NA>

In [None]:
pd.NaT

NaT

__Careful when using mising values for comparisons__

In [None]:
np.nan == np.nan

False

In [None]:
np.nan in [np.nan]

True

In [None]:
np.nan is np.nan

True

In [None]:
pd.NA == pd.NA

<NA>

__Import Data__

In [None]:
df = pd.read_csv('Datasets\Titanic.csv').sample(50, random_state=100)
df.to_csv('Datasets\Titanic_orig.csv', index=False)  # store for later ref

In [None]:
df.reset_index(inplace=True)

__Insert missing values randomly__, so as to be able to compare

In [None]:
df_copy = df.copy()

In [None]:
n_missing = 10

for i in range(n_missing):
    row = np.random.randint(1, df.shape[0])
    df.at[row, "Age"] = pd.NA

    
for i in range(n_missing):
    row = np.random.randint(1, df.shape[0])
    df.at[row, "Fare"] = pd.NA
    
for i in range(n_missing):
    row = np.random.randint(1, df.shape[0])
    df.at[row, "Sex"] = pd.NA

In [None]:
df.head(10)

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,205,206,0,3,"Strom, Miss. Telma Matilda",female,2.0,0,1,347054,10.4625,G6,S
1,44,45,1,3,"Devaney, Miss. Margaret Delia",female,,0,0,330958,7.8792,,Q
2,821,822,1,3,"Lulic, Mr. Nikola",male,27.0,0,0,315098,8.6625,,S
3,458,459,1,2,"Toomey, Miss. Ellen",female,50.0,0,0,F.C.C. 13531,10.5,,S
4,795,796,0,2,"Otter, Mr. Richard",male,39.0,0,0,28213,13.0,,S
5,118,119,0,1,"Baxter, Mr. Quigg Edmond",male,,0,1,PC 17558,247.521,B58 B60,C
6,424,425,0,3,"Rosblom, Mr. Viktor Richard",male,18.0,1,1,370129,,,S
7,678,679,0,3,"Goodwin, Mrs. Frederick (Augusta Tyler)",female,43.0,1,6,CA 2144,46.9,,S
8,269,270,1,1,"Bissette, Miss. Amelia",female,35.0,0,0,PC 17760,,C99,S
9,229,230,0,3,"Lefebre, Miss. Mathilde",female,,3,1,4133,25.4667,,S


In [None]:
df.to_csv("Datasets/Titanic_missing.csv", index=False)

## Checking and Selecting for Null Values

In [None]:
df = pd.read_csv("Datasets/Titanic_missing.csv")

In [None]:
df.isnull().head(10)

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,False,True,False
5,False,False,False,False,False,False,True,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,True,True,False
7,False,False,False,False,False,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,False,False,False,False,True,False,False
9,False,False,False,False,False,False,True,False,False,False,False,True,False


In [None]:
df.notnull().head()

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,True,True,True,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,False,True,True,True,True,False,True
2,True,True,True,True,True,True,True,True,True,True,True,False,True
3,True,True,True,True,True,True,True,True,True,True,True,False,True
4,True,True,True,True,True,True,True,True,True,True,True,False,True


In [None]:
df['Age'].head()

__Drop rows that contain missing values in `Age`__

In [None]:
# Drop rows that contain missing values in Age
df[df['Age'].notnull()].head()

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,205,206,0,3,"Strom, Miss. Telma Matilda",female,2,0,1,347054,10.4625,G6,S
2,821,822,1,3,"Lulic, Mr. Nikola",male,27,0,0,315098,8.6625,,S
3,458,459,1,2,"Toomey, Miss. Ellen",female,50,0,0,F.C.C. 13531,10.5,,S
4,795,796,0,2,"Otter, Mr. Richard",male,39,0,0,28213,13.0,,S
6,424,425,0,3,"Rosblom, Mr. Viktor Richard",male,18,1,1,370129,,,S


Drop rows that contain missing values in `Age` or `Fare` 

In [None]:
df[(df['Age'].notnull()) | df['Fare'].notnull()]

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,205,206,0,3,"Strom, Miss. Telma Matilda",female,2.0,0,1,347054,10.4625,G6,S
1,44,45,1,3,"Devaney, Miss. Margaret Delia",female,,0,0,330958,7.8792,,Q
2,821,822,1,3,"Lulic, Mr. Nikola",male,27.0,0,0,315098,8.6625,,S
3,458,459,1,2,"Toomey, Miss. Ellen",female,50.0,0,0,F.C.C. 13531,10.5,,S
4,795,796,0,2,"Otter, Mr. Richard",male,39.0,0,0,28213,13.0,,S
5,118,119,0,1,"Baxter, Mr. Quigg Edmond",male,,0,1,PC 17558,247.521,B58 B60,C
6,424,425,0,3,"Rosblom, Mr. Viktor Richard",male,18.0,1,1,370129,,,S
7,678,679,0,3,"Goodwin, Mrs. Frederick (Augusta Tyler)",female,43.0,1,6,CA 2144,46.9,,S
8,269,270,1,1,"Bissette, Miss. Amelia",female,35.0,0,0,PC 17760,,C99,S
9,229,230,0,3,"Lefebre, Miss. Mathilde",female,,3,1,4133,25.4667,,S


### Challenge

Count the total number of missing values in `df`

```python
import pandas as pd
import numpy as np

df = pd.read_csv('Datasets\Titanic.csv').sample(50, random_state=100)

n_missing = np.random.randint(4, 15, 1)

for i in range(n_missing):
    row = np.random.randint(1, df.shape[0])
    df.at[row, "Age"] = pd.NA

    
for i in range(n_missing):
    row = np.random.randint(1, df.shape[0])
    df.at[row, "Fare"] = pd.NA
    
for i in range(n_missing):
    row = np.random.randint(1, df.shape[0])
    df.at[row, "Sex"] = pd.NA
```

Code URL: https://git.io/JswTN

In [None]:
# Solution
import pandas as pd
import numpy as np

df = pd.read_csv('Datasets\Titanic.csv').sample(50, random_state=100)

In [None]:
n_missing = np.random.randint(4, 15, (1))[0]
print(n_missing)

for i in range(n_missing):
    row = np.random.randint(1, df.shape[0], (1))[0]
    df.at[row, "Age"] = pd.NA

for i in range(n_missing):
    row = np.random.randint(1, df.shape[0], (1))[0]
    df.at[row, "Fare"] = pd.NA
    
for i in range(n_missing):
    row = np.random.randint(1, df.shape[0], (1))[0]
    df.at[row, "Sex"] = pd.NA

11


In [None]:
df.isna().sum()

PassengerId    24
Survived       24
Pclass         24
Name           24
Sex            24
Age            34
SibSp          24
Parch          24
Ticket         24
Fare           25
Cabin          62
Embarked       24
dtype: int64

In [None]:
df.isna().sum().sum()

337

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>2. Threshold based Dropping</h2>
</div>

In [None]:
import pandas as pd
df = pd.read_csv("Datasets/Titanic_missing.csv")

In [None]:
df.head()

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,205,206,0,3,"Strom, Miss. Telma Matilda",female,2.0,0,1,347054,10.4625,G6,S
1,44,45,1,3,"Devaney, Miss. Margaret Delia",female,,0,0,330958,7.8792,,Q
2,821,822,1,3,"Lulic, Mr. Nikola",male,27.0,0,0,315098,8.6625,,S
3,458,459,1,2,"Toomey, Miss. Ellen",female,50.0,0,0,F.C.C. 13531,10.5,,S
4,795,796,0,2,"Otter, Mr. Richard",male,39.0,0,0,28213,13.0,,S


`dropna()` drops all rows that contain a missing value by default.

In [None]:
df.dropna()

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,205,206,0,3,"Strom, Miss. Telma Matilda",female,2.0,0,1,347054,10.4625,G6,S
10,689,690,1,1,"Madill, Miss. Georgette Alexandra",female,15.0,0,1,24160,211.3375,B5,S
16,412,413,1,1,"Minahan, Miss. Daisy E",female,33.0,1,0,19928,90.0,C78,Q
21,325,326,1,1,"Young, Miss. Marie Grice",female,36.0,0,0,PC 17760,135.6333,C32,C
49,591,592,1,1,"Stephenson, Mrs. Walter Bertram (Martha Eustis)",female,52.0,1,0,36947,78.2667,D20,C


Sometimes you want to drop rows only if at least 'x' missing values are present. You can control that using the `thresh`, which tells how many non-missing values need to be there to retain the row. 

In [None]:
df.dropna(thresh=df.shape[1]-1) # atmost 1 missing value
# df.dropna(thresh=df.shape[1]-2) # atmost 2 missing values

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,205,206,0,3,"Strom, Miss. Telma Matilda",female,2.0,0,1,347054,10.4625,G6,S
2,821,822,1,3,"Lulic, Mr. Nikola",male,27.0,0,0,315098,8.6625,,S
3,458,459,1,2,"Toomey, Miss. Ellen",female,50.0,0,0,F.C.C. 13531,10.5,,S
4,795,796,0,2,"Otter, Mr. Richard",male,39.0,0,0,28213,13.0,,S
5,118,119,0,1,"Baxter, Mr. Quigg Edmond",male,,0,1,PC 17558,247.5208,B58 B60,C
7,678,679,0,3,"Goodwin, Mrs. Frederick (Augusta Tyler)",female,43.0,1,6,CA 2144,46.9,,S
8,269,270,1,1,"Bissette, Miss. Amelia",female,35.0,0,0,PC 17760,,C99,S
10,689,690,1,1,"Madill, Miss. Georgette Alexandra",female,15.0,0,1,24160,211.3375,B5,S
11,320,321,0,3,"Dennis, Mr. Samuel",male,22.0,0,0,A/5 21172,7.25,,S
12,405,406,0,2,"Gale, Mr. Shadrach",male,34.0,1,0,28664,21.0,,S


__Alternately, you can drop an entire column if it contains a missing value by setting axis=1 (or 'columns')__

In [None]:
df.dropna(axis='columns')

### Challenge:

From `df` Drop columns that contains more than 10% missing values.

In [None]:
import pandas as pd
df = pd.read_csv("Datasets/Titanic_missing.csv")

In [None]:
# Solution
df.dropna(thresh=df.shape[0]*.9, axis=1)

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,SibSp,Parch,Ticket,Embarked
0,205,206,0,3,"Strom, Miss. Telma Matilda",0,1,347054,S
1,44,45,1,3,"Devaney, Miss. Margaret Delia",0,0,330958,Q
2,821,822,1,3,"Lulic, Mr. Nikola",0,0,315098,S
3,458,459,1,2,"Toomey, Miss. Ellen",0,0,F.C.C. 13531,S
4,795,796,0,2,"Otter, Mr. Richard",0,0,28213,S
5,118,119,0,1,"Baxter, Mr. Quigg Edmond",0,1,PC 17558,C
6,424,425,0,3,"Rosblom, Mr. Viktor Richard",1,1,370129,S
7,678,679,0,3,"Goodwin, Mrs. Frederick (Augusta Tyler)",1,6,CA 2144,S
8,269,270,1,1,"Bissette, Miss. Amelia",0,0,PC 17760,S
9,229,230,0,3,"Lefebre, Miss. Mathilde",3,1,4133,S


<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>3. Approaches to Filling Data</h2>
</div>

Depending on the context you can fill missing data with a standard, meaningful value.

Some common methods are, filling the missing with:
1. __Zero__ (Ex: if the 'Fare' data was missing, it could be possible that the person did not pay anything, in which case replacing with zero makes sense.)

2. __Most Frequent Observation__ : For Categorical data

2. __Mean / Median__ of the entire data / appropriate group

3. __Forward Fill / Backward Fill__ for Sequential Data

4. __Interpolation__ on case basis. Ex: Ordered data

In [None]:
import pandas as pd
df = pd.read_csv("Datasets/Titanic_missing.csv")

In [None]:
df

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,205,206,0,3,"Strom, Miss. Telma Matilda",female,2.0,0,1,347054,10.4625,G6,S
1,44,45,1,3,"Devaney, Miss. Margaret Delia",female,,0,0,330958,7.8792,,Q
2,821,822,1,3,"Lulic, Mr. Nikola",male,27.0,0,0,315098,8.6625,,S
3,458,459,1,2,"Toomey, Miss. Ellen",female,50.0,0,0,F.C.C. 13531,10.5,,S
4,795,796,0,2,"Otter, Mr. Richard",male,39.0,0,0,28213,13.0,,S
5,118,119,0,1,"Baxter, Mr. Quigg Edmond",male,,0,1,PC 17558,247.5208,B58 B60,C
6,424,425,0,3,"Rosblom, Mr. Viktor Richard",male,18.0,1,1,370129,,,S
7,678,679,0,3,"Goodwin, Mrs. Frederick (Augusta Tyler)",female,43.0,1,6,CA 2144,46.9,,S
8,269,270,1,1,"Bissette, Miss. Amelia",female,35.0,0,0,PC 17760,,C99,S
9,229,230,0,3,"Lefebre, Miss. Mathilde",female,,3,1,4133,25.4667,,S


__Fill with Zero__

In [None]:
# Fill all NA with 0
df.fillna(0)

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,205,206,0,3,"Strom, Miss. Telma Matilda",female,2.0,0,1,347054,10.4625,G6,S
1,44,45,1,3,"Devaney, Miss. Margaret Delia",female,0.0,0,0,330958,7.8792,0,Q
2,821,822,1,3,"Lulic, Mr. Nikola",male,27.0,0,0,315098,8.6625,0,S
3,458,459,1,2,"Toomey, Miss. Ellen",female,50.0,0,0,F.C.C. 13531,10.5,0,S
4,795,796,0,2,"Otter, Mr. Richard",male,39.0,0,0,28213,13.0,0,S
5,118,119,0,1,"Baxter, Mr. Quigg Edmond",male,0.0,0,1,PC 17558,247.5208,B58 B60,C
6,424,425,0,3,"Rosblom, Mr. Viktor Richard",male,18.0,1,1,370129,0.0,0,S
7,678,679,0,3,"Goodwin, Mrs. Frederick (Augusta Tyler)",female,43.0,1,6,CA 2144,46.9,0,S
8,269,270,1,1,"Bissette, Miss. Amelia",female,35.0,0,0,PC 17760,0.0,C99,S
9,229,230,0,3,"Lefebre, Miss. Mathilde",female,0.0,3,1,4133,25.4667,0,S


__Fill with the most frequent value__

It may not be appropriate to fill categorical / string data with 0's. For categorical the common practice it to fill it up with the most frequent value (in the entire dataset or in a given group).

In [None]:
# Fill with the most frequent value
df.Sex.value_counts()

male      22
female    20
Name: Sex, dtype: int64

In [None]:
most_frequent = df.Sex.value_counts().index[0]
most_frequent

'male'

In [None]:
df['Sex'].fillna(most_frequent).head(10)

0    female
1    female
2      male
3    female
4      male
5      male
6      male
7    female
8    female
9    female
Name: Sex, dtype: object

__Fill with a standard string.__

In [None]:
df['Sex'] = df['Sex'].fillna("Empty")

In [None]:
df

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,205,206,0,3,"Strom, Miss. Telma Matilda",female,2.0,0,1,347054,10.4625,G6,S
1,44,45,1,3,"Devaney, Miss. Margaret Delia",female,,0,0,330958,7.8792,,Q
2,821,822,1,3,"Lulic, Mr. Nikola",male,27.0,0,0,315098,8.6625,,S
3,458,459,1,2,"Toomey, Miss. Ellen",female,50.0,0,0,F.C.C. 13531,10.5,,S
4,795,796,0,2,"Otter, Mr. Richard",male,39.0,0,0,28213,13.0,,S
5,118,119,0,1,"Baxter, Mr. Quigg Edmond",male,,0,1,PC 17558,247.5208,B58 B60,C
6,424,425,0,3,"Rosblom, Mr. Viktor Richard",male,18.0,1,1,370129,,,S
7,678,679,0,3,"Goodwin, Mrs. Frederick (Augusta Tyler)",female,43.0,1,6,CA 2144,46.9,,S
8,269,270,1,1,"Bissette, Miss. Amelia",female,35.0,0,0,PC 17760,,C99,S
9,229,230,0,3,"Lefebre, Miss. Mathilde",female,,3,1,4133,25.4667,,S


__Fill with the Mean or the Median__

In [None]:
# Mean value
mean_val = df['Fare'].mean()
df['Fare'].fillna(df['Fare'].mean()).head(10)

0     10.46250
1      7.87920
2      8.66250
3     10.50000
4     13.00000
5    247.52080
6     42.00979
7     46.90000
8     42.00979
9     25.46670
Name: Fare, dtype: float64

__Fill with the Mean or the Median by group__

In [None]:
# Mean value by group
df['Fare'] = df.groupby('Pclass')['Fare'].apply(lambda x: x.mean())
df.head()

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,205,206,0,3,"Strom, Miss. Telma Matilda",female,2.0,0,1,347054,,G6,S
1,44,45,1,3,"Devaney, Miss. Margaret Delia",female,,0,0,330958,109.454536,,Q
2,821,822,1,3,"Lulic, Mr. Nikola",male,27.0,0,0,315098,20.416667,,S
3,458,459,1,2,"Toomey, Miss. Ellen",female,50.0,0,0,F.C.C. 13531,14.632085,,S
4,795,796,0,2,"Otter, Mr. Richard",male,39.0,0,0,28213,,,S


### Interpolation

Be careful with this technique, you should try to really understand whether or not this is a valid choice for your data. You should also know there are multiple interpolation methods available, the default is a linear method.

In [None]:
fare = {'first_class':100, 'second_class':np.nan, 'third_class':60, 'open_class':20}

In [None]:
ser = pd.Series(fare)

In [None]:
ser

first_class     100.0
second_class      NaN
third_class      60.0
open_class       20.0
dtype: float64

By default, `ser.interpolate()` will do a linear interpolation.

Linear interpolation will take the index (0,1,2..) as the X and the column you want to interpolate as Y and do the interpolation. So, you need to make sure the X is sorted in your data to make this work.

![image.png](attachment:image.png)

Source: Wikipedia

In [None]:
ser.interpolate()

first_class     100.0
second_class     80.0
third_class      60.0
open_class       20.0
dtype: float64

You could explore other methods as well, such as 'spline'.

In [None]:
ser.reset_index().interpolate(method='spline', order=2)

What if you have a it as a Dataframe?

Works as well. But take care of the index.

In [None]:
df = pd.DataFrame(ser,columns=['Fare'])

In [None]:
df

Unnamed: 0,Fare
first_class,100.0
second_class,
third_class,60.0
open_class,20.0


In [None]:
df.interpolate()

Unnamed: 0,Fare
first_class,100.0
second_class,80.0
third_class,60.0
open_class,20.0


In [None]:
df = df.reset_index()

In [None]:
df

Unnamed: 0,index,Fare
0,first_class,100.0
1,second_class,
2,third_class,60.0
3,open_class,20.0


In [None]:
df.interpolate(method='spline',order=2)

Unnamed: 0,index,Fare
0,first_class,100.0
1,second_class,86.666667
2,third_class,60.0
3,open_class,20.0


## Challenge

Fill missing `Fares` in `Titanic_missing.csv` using linear and spline interpolation of order 2. 

```python
import pandas as pd
df = pd.read_csv("Datasets/Titanic_missing.csv")
```

In [None]:
import pandas as pd
df_orig = pd.read_csv("Datasets/Titanic_orig.csv") 
df = pd.read_csv("Datasets/Titanic_missing.csv")

In [None]:
df[['Pclass', 'Fare']].to_csv("Datasets/Titanic_2cols.csv", index=False)

In [None]:
df.loc[:, ['Pclass', 'Fare']].head(20)

Unnamed: 0,Pclass,Fare
0,3,10.4625
1,3,7.8792
2,3,8.6625
3,2,10.5
4,2,13.0
5,1,247.5208
6,3,
7,3,46.9
8,1,
9,3,25.4667


In [None]:
# Linear
linear = df.loc[:, ['Pclass', 'Fare']].interpolate()
linear.head(10)

Unnamed: 0,Pclass,Fare
0,3,10.4625
1,3,7.8792
2,3,8.6625
3,2,10.5
4,2,13.0
5,1,247.5208
6,3,147.2104
7,3,46.9
8,1,36.18335
9,3,25.4667


__Compare against the actuals__

There is a significant deviation, because, the values are not sorted per the `Pclass`.

In [None]:
pd.concat([df_orig.loc[:, ['Pclass', 'Fare']], df.loc[:, ['Pclass', 'Fare']], linear], axis=1)

Unnamed: 0,Pclass,Fare,Pclass.1,Fare.1,Pclass.2,Fare.2
0,3,10.4625,3,10.4625,3,10.4625
1,3,7.8792,3,7.8792,3,7.8792
2,3,8.6625,3,8.6625,3,8.6625
3,2,10.5,2,10.5,2,10.5
4,2,13.0,2,13.0,2,13.0
5,1,247.5208,1,247.5208,1,247.5208
6,3,20.2125,3,,3,147.2104
7,3,46.9,3,46.9,3,46.9
8,1,135.6333,1,,1,36.18335
9,3,25.4667,3,25.4667,3,25.4667


__Let's sort data as per `Pclass` and try again__.

In [None]:
# Linear
linear2 = df.loc[:, ['Pclass', 'Fare']].sort_values('Pclass').interpolate()
linear.head(10)

Unnamed: 0,Pclass,Fare
0,3,10.4625
1,3,7.8792
2,3,8.6625
3,2,10.5
4,2,13.0
5,1,247.5208
6,3,147.2104
7,3,46.9
8,1,36.18335
9,3,25.4667


In [None]:
pd.concat([df_orig.loc[:, ['Pclass', 'Fare']], df.loc[:, ['Pclass', 'Fare']], linear, linear2], axis=1)

Unnamed: 0,Pclass,Fare,Pclass.1,Fare.1,Pclass.2,Fare.2,Pclass.3,Fare.3
0,3,10.4625,3,10.4625,3,10.4625,3,10.4625
1,3,7.8792,3,7.8792,3,7.8792,3,7.8792
2,3,8.6625,3,8.6625,3,8.6625,3,8.6625
3,2,10.5,2,10.5,2,10.5,2,10.5
4,2,13.0,2,13.0,2,13.0,2,13.0
5,1,247.5208,1,247.5208,1,247.5208,1,247.5208
6,3,20.2125,3,,3,147.2104,3,27.78125
7,3,46.9,3,46.9,3,46.9,3,46.9
8,1,135.6333,1,,1,36.18335,1,73.5
9,3,25.4667,3,25.4667,3,25.4667,3,25.4667
