**MASTER AT BIG DATE AND BUSINESS ANALYTICS - UNED 2019 2020 - MALNUTRITION AND HEALTH SYSTEMS ACROSS THE WORLD**

NIDIA I. PAIVA VEGA

This is a digest of the information described at https://www.kaggle.com/danevans/world-bank-wdi-212-health-systems, https://www.kaggle.com/ruchi798/malnutrition-across-the-globe

Context

Malnutrition continues to be the reason for making children much more vulnerable to diseases and death.
There are 4 broad types of malnutrition: wasting, stunting, underweight and overweight.

Inspiration
- Was there a decline or rise in the number of severe wasting cases country-wise?
- Which countries bear the greatest  of severe wasting forms of malnutrition?
- % of  wasted children under 5, by country income classification and various health spending per capita by Country, as well as doctors, nurses and midwives, and specialist surgical staff per capita.


CONTENT
Notes, explanations, etc.
* There are countries/regions in the World Bank data not in the Covid-19 data, and countries/regions in the Covid-19 data with no World Bank data. This is unavoidable.
* There were political decisions made in both datasets that may cause problems. I chose to go forward with the data as presented, and did not attempt to modify the decisions made by the dataset creators.

Columns are as follows:

Country_Region: the region as used in Kaggle Covid-19 spread data challenges.

WorldBankName: the name of the country used by the World Bank

HealthexppctGDP2016: Level of current health expenditure expressed as a percentage of GDP. Estimates of current health expenditures include healthcare goods and services consumed during each year. This indicator does not include capital health expenditures such as buildings, machinery, IT and stocks of vaccines for emergency or outbreaks.

Healthexppublicpct2016: Share of current health expenditures funded from domestic public sources for health. Domestic public sources include domestic revenue as internal transfers and grants, transfers, subsidies to voluntary health insurance beneficiaries, non-profit institutions serving households (NPISH) or enterprise financing schemes as well as compulsory prepayment and social health insurance contributions. They do not include external resources spent by governments on health.

Healthexpoutofpocketpct2016: Share of out-of-pocket payments of total current health expenditures. Out-of-pocket payments are spending on health directly out-of-pocket by households.

HealthexppercapitaUSD_2016: Current expenditures on health per capita in current US dollars. Estimates of current health expenditures include healthcare goods and services consumed during each year.

percapitaexpPPP2016: Current expenditures on health per capita expressed in international dollars at purchasing power parity (PPP).

Externalhealthexppct2016: Share of current health expenditures funded from external sources. External sources compose of direct foreign transfers and foreign transfers distributed by government encompassing all financial inflows into the national health system from outside the country. External sources either flow through the government scheme or are channeled through non-governmental organizations or other schemes.

Physiciansper1000_2009-18: Physicians include generalist and specialist medical practitioners.

Nursemidwifeper10002009-18: Nurses and midwives include professional nurses, professional midwives, auxiliary nurses, auxiliary midwives, enrolled nurses, enrolled midwives and other associated personnel, such as dental nurses and primary care nurses.

Specialistsurgicalper10002008-18: Specialist surgical workforce is the number of specialist surgical, anaesthetic, and obstetric (SAO) providers who are working in each country per 100,000 population.

Completenessofbirthreg2009-18: Completeness of birth registration is the percentage of children under age 5 whose births were registered at the time of the survey. The numerator of completeness of birth registration includes children whose birth certificate was seen by the interviewer or whose mother or caretaker says the birth has been registered.

Completenessofdeathreg2008-16: Completeness of death registration is the estimated percentage of deaths that are registered with their cause of death information in the vital registration system of a country.

Severe Wasting - % of children aged 0–59 months who are below minus three standard deviations from median weight-for-height
Wasting – Moderate and severe: % of children aged 0–59 months who are below minus two standard deviations from median weight-for-height
Overweight – Moderate and severe: % aged 0-59 months who are above two standard deviations from median weight-for-height
Stunting – Moderate and severe: % of children aged 0–59 months who are below minus two standard deviations from median height-for-age
Underweight – Moderate and severe: % of children aged 0–59 months who are below minus two standard deviations from median weight-for-age
Continent_id: the name of the continent used by the present research. 



In [None]:
# Load libraries.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


pd.set_option('display.max_columns', None)

pd.set_option('display.max_columns', None)

In [None]:
# Importing the data and displaying some rows
df = pd.read_csv("../input/malnutrition-across-the-globe/country-wise-average.csv")
country_average = df
display(country_average.head(10))

df = pd.read_csv("../input/world-bank-wdi-212-health-systems/2.12_Health_systems.csv")
health_system = df



In [None]:
country_average.count()

In [None]:
#Showing the NaNs
country_average.isna().sum()

In [None]:
# View the second dataset
health_system.count()


In [None]:
# Counting the NaNs
health_system.isna().sum()

In [None]:
# Removed one column containing redundant values
health_system_1=health_system.drop(['Province_State'],axis=1)

Descripción del dataset

In [None]:
# Showing all variables-- distribution
health_system_1.describe()


In [None]:
#Drow Income Classification
import numpy as np
import matplotlib.pyplot as plt

pk_colors = ['#A8B820',  # Bug,
             '#705848',  # Dark,
             '#7038F8',  # Dragon
             '#F8D030',  # Electric
        
             ]

helthSys_cnt = country_average["Income Classification"].value_counts(sort=False).sort_index()
helthSys_cnt = pd.concat([helthSys_cnt, pd.DataFrame(pk_colors,
                                           index=helthSys_cnt.index,
                                           columns=["Colors"])], axis=1)
helthSys_cnt.sort_values("Income Classification", inplace=True)
helthSys_cnt_bar = helthSys_cnt.plot(kind='barh', y="Income Classification", color=helthSys_cnt.Colors,
                                           legend=False, figsize=(8, 8))
helthSys_cnt_bar.set_title("Number of Country\nIncome Classification",
                                           fontsize=16, weight="bold")

helthSys_cnt_bar.set_xlabel("Number of Country")


In [None]:
# We created the *fill_NaN* function to fill up the Country Region column with the information from the World Bank Name column.
# The Country Region column contains NaN
def rellenar_NaN(row):
    if pd.isnull(row['Country_Region']):
        val = row['World_Bank_Name']
    else:
        val = row['Country_Region']
    return val


In [None]:
# We apply the fill_NaN function and assign it to the Country Region column.
health_system_1['Country_Region']=health_system_1.apply(rellenar_NaN, axis=1)

In [None]:
# We check it out
health_system_1['Country_Region'].isna().sum()

In [None]:
# We filter the countries that do not have information in the next column.
health_system_1 = health_system_1[health_system_1['Health_exp_per_capita_USD_2016'].notnull()]

In [None]:
health_system_1_numericas=health_system_1.select_dtypes(include=['number'])
health_system_1_categoricas=health_system_1.select_dtypes(include=['object'])

In [None]:
# View categorical variables
health_system_1_categoricas

Creamos un encoding de la columna de los nombres de los paises

In [None]:
# Encoding of the column of *country names*
health_system_1_categoricas['Country_Region_num']=health_system_1_categoricas.Country_Region.astype('category').cat.codes


In [None]:
# View of numerical variables
health_system_1_numericas.isna().sum()

In [None]:
# Let´s look at how many columns without *health system* information are there...
filtered_df= pd.merge(health_system_1_categoricas, health_system_1_numericas, left_index=True, right_index=True)

In [None]:
filtered_df.isna().sum()

We have lines with very important value absent on the health system as: "total of birth...", "total of death", "out-of-pocket spending on health", "surgeons per thousand ..." and "doctors per thousand ...". 

In [None]:
# Correlations. Show a mini plot of Current
my_plot = filtered_df.plot("per_capita_exp_PPP_2016", "Health_exp_pct_GDP_2016", kind="scatter")
plt.show()

In [None]:
# Correlations. Show a mini plot
my_plot = filtered_df.plot("Nurse_midwife_per_1000_2009-18", "Health_exp_pct_GDP_2016", kind="scatter")
plt.show()

In [None]:
filtered_df_2=filtered_df[['Country_Region','Health_exp_pct_GDP_2016']]
filtered_df_2=filtered_df_2.head(5)

In [None]:
# Correlation mini Bar Plot
splot=filtered_df_2.plot(kind='bar',stacked=True,title="Region and current health as a GDP")
splot.set_xlabel("Country_Region")

In [None]:
# the countries in the *country average* file to capital letters in order to cross the datasets
filtered_df['Country_Region'] = filtered_df['Country_Region'].str.upper()

In [None]:
dataset_completeness=filtered_df.merge(country_average, left_on='Country_Region', right_on='Country')

In [None]:
dataset_completeness.isna().sum()

In [None]:
# Removing Country column
dataset_completeness.drop(['Country'],axis=1)

In [None]:
# Pint ---- Correlation Matrix
import seaborn as sns

Var_Corr = dataset_completeness.corr()


In [None]:
Var_Corr

In [None]:

sns.heatmap(Var_Corr, xticklabels=Var_Corr.columns, yticklabels=Var_Corr.columns)


CONTINENT STUDY


In [None]:
dataset_completeness['Country_Region'] = dataset_completeness['Country_Region'].str.upper() 

**STUDY OF HEALTH AND NUTRITION SYSTEMS BY CONTINENT REGIONS**

**Regretion Logitics model. Target (Dummy) Severe Wasting**


In [None]:
# Change the column name for *Population*
dataset_completeness_1 = dataset_completeness.rename(columns={'''U5 Population ('000s)''': 'Population'})

In [None]:
dataset_completeness_1['pctje_Muerte_por_malnutricion'] =dataset_completeness_1['Severe Wasting']*0.3

dataset_completeness_1['Num_Muerte_por_malnutricion'] =(dataset_completeness_1['pctje_Muerte_por_malnutricion']/100)*dataset_completeness_1['Population']

We defined the criterion of our target variable

In [None]:
# We assume  4% cap to define like country_severewasted in this target variable.
def label_race (row):
   if row['Severe Wasting'] >= 3.0 :
      return 1
   else:
      return 0

In [None]:
# we apply the function label_race to datset
dataset_completeness_1['Pais_malnutrido']=dataset_completeness_1.apply (lambda row: label_race(row), axis=1)

Evololution of the research: Label Encoders --> Create a label each country.

In [None]:
country_name=dataset_completeness_1['Country_Region']

In [None]:
dataset_completeness_reg_lineal=dataset_completeness_1

In [None]:
# Remove the columns wich containt categoric variables  
dataset_completeness_1=dataset_completeness_1.drop(['Severe Wasting','Country_Region','World_Bank_Name','Country'], axis=1)

In [None]:
dataset_completeness_1.isna().sum()

In [None]:
#Fill up with 0 all NaN
dataset_completeness_1 = dataset_completeness_1.fillna(0)

In [None]:
#import libraries for Logistic Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import calibration_curve

In [None]:
dataset_completeness_1.columns

In [None]:
# Apply dTypes to look up variables type
dataset_completeness_1.dtypes

In [None]:
# Choose target and predictor variables
from sklearn.model_selection import train_test_split

y=dataset_completeness_1['Pais_malnutrido']

x=dataset_completeness_1.drop(['Pais_malnutrido'], axis=1)

X_train,X_test, y_train, y_test=train_test_split( x, y, test_size=0.33, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
# Showing y_test of target variable
y_test

In [None]:
lr = LogisticRegression()
lr.fit(X_train,y_train)

In [None]:
# Show the prediction of logistic Regretion
predictions = lr.predict(X_test)
print(predictions)

In [None]:
predictions

In [None]:
lr.score(X_test,y_test)

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, predictions))

In [None]:
# Show the Confusion Matrix
print(confusion_matrix(y_test, predictions))


Feature Importance --> 

In [None]:
lista_col=X_train.columns.tolist()

Export table with coefficients and column names to excel.

In [None]:
from matplotlib import pyplot
# get importance
importance = lr.coef_[0]
# summarize feature importance

for i,v in enumerate(importance):
    
    print(lista_col[i]+', Score: '+str( "%.2f" % (v,)))

pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

Report of the first model:
- Predictions level too hight
- Variables of Malnutrition would can be one correlationed other and make  target conditioned of them.
- Variables that came of  health sistems got into model.  Is good to explanations in this study


In [None]:
#dataset_completeness=dataset_completeness.drop(['Overweight','pctje_Muerte_por_malnutricion', 'Num_Muerte_por_malnutricion'], axis=1)
# We choose only Helth-Systems variables this time
from sklearn.model_selection import train_test_split

y=dataset_completeness_1['Pais_malnutrido']

x=dataset_completeness_1.drop(['Pais_malnutrido','pctje_Muerte_por_malnutricion','Num_Muerte_por_malnutricion'
                           ,'Overweight','Stunting','Underweight','Wasting','Income Classification','Country_Region_num'], axis=1)

X_train,X_test, y_train, y_test=train_test_split( x, y, test_size=0.33, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


lr = LogisticRegression()
lr.fit(X_train,y_train)

predictions = lr.predict(X_test)
print(predictions)

lr.score(X_test,y_test)

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, predictions))

In [None]:
# Show the Confusion Matrix
print(confusion_matrix(y_test, predictions))


In [None]:
# We convert the xtrain column array into a list to use the function for
lista_col=X_train.columns.tolist()

In [None]:
from matplotlib import pyplot
# get importance
importance = lr.coef_[0]
# summarize feature importance

for i,v in enumerate(importance):
    
    print(lista_col[i]+', Score: '+str( "%.2f" % (v,)))
    
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

In [None]:
# Join up ytest, la xtest, the target and the prediction. Transform the prediction array to dataframe; do the some with ytest.
# merge function brings together the predictive variables and the objective variable ytest.
# Then the index is reset out to cross with the prediction
df = pd.DataFrame(data=predictions,  columns=["pred"])

y_test=y_test.to_frame()

prediction=pd.merge(X_test, y_test, left_index=True, right_index=True)

prediction.reset_index(inplace=True)

prediction_completeness=pd.merge(prediction, df, left_index=True, right_index=True)



In [None]:
prediction_completeness

In [None]:
prediction_completeness.to_excel('prediction_completeness.xlsx')

Conclusions of the second model

- The prediction level appears with a slightly lower level. This is an acceptable level of prediction for the attachment of dataset.
- Level of prediction appears a bit more low
- The fields of health-systems are opposite of  fields  Malnutrition

RANDOM FORREST.
In this case we use categorical variables as targets

In [None]:
def label_race (row):
   if row['Pais_malnutrido'] == 1 :
      return 'Malnutrido'
   else:
      return 'Nutrido'

In [None]:
dataset_completeness_1['Pais_malnutrido']=dataset_completeness_1.apply (lambda row: label_race(row), axis=1)

In [None]:
y=dataset_completeness_1['Pais_malnutrido']

x=dataset_completeness_1.drop(['Pais_malnutrido','pctje_Muerte_por_malnutricion','Num_Muerte_por_malnutricion'], axis=1)

X_train,X_test, y_train, y_test=train_test_split( x, y, test_size=0.33, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
X_test

In [None]:
# Load the Library
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

In [None]:
# Take predict on the xtest and see the score it had obtained
y_pred=rfc.predict(X_test)
df_pred = pd.DataFrame(y_pred)
rfc.score(X_test,y_test)


In [None]:
# Take the function Matrix imporatamos to see prediction
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

In [None]:
# We choose the important variables
importances = rfc.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfc.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]


In [None]:
# We create a list with an array of predictions.
nom_columnas=X_test.columns.tolist()

In [None]:
# Let`s see the ranking of Feature
print("Feature ranking:")

for f in range(X_test.shape[1]):
    print(nom_columnas[f]+" (%f)" % (  importances[indices[f]]))


Let's repeat the model by removing that Health_exp_pct_GDP_2016 (0.166318). It look overffiting.

In [None]:
from sklearn.model_selection import train_test_split
# Creamos un dataset paralelo para hacer el random forrest 
dataset_completo_rf=dataset_completeness_1.drop(['Country_Region_num'], axis=1)
dataset_completeness_1=dataset_completeness_1.drop(['Country_Region_num'], axis=1)

# 114 Paises
# 102 Nutridos
# 12 Malnutridos
# Evaluamos el numero de paises malnutridos y nutridos que tenemos
print(dataset_completo_rf['Pais_malnutrido'].value_counts())

# Balanceamos el modelo para quedarnos con el mismo número de malnutridos que nutridos
dataset_completo_rf_1 = dataset_completo_rf.groupby('Pais_malnutrido')
dataset_completo_rf_1 = pd.DataFrame(dataset_completo_rf_1.apply(lambda x: x.sample(dataset_completo_rf_1.size().min()).reset_index(drop=True)))

# Evaluamos los cambios realizados
print(dataset_completo_rf_1['Pais_malnutrido'].value_counts())

In [None]:
# Utilizamos el nuevo dataset para lanzar el random forrest
y=dataset_completo_rf_1['Pais_malnutrido']

x=dataset_completo_rf_1.drop(['Pais_malnutrido','pctje_Muerte_por_malnutricion','Num_Muerte_por_malnutricion'], axis=1)

X_train,X_test, y_train, y_test=train_test_split( x, y, test_size=0.35, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
rfc = RandomForestClassifier()
predictor=rfc.fit(X_train, y_train)

y_pred=rfc.predict(X_test)
df_pred = pd.DataFrame(y_pred)

importances = rfc.feature_importances_

In [None]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

In [None]:
importances = rfc.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfc.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]
nom_columnas=X_test.columns.tolist()

In [None]:
# Print the feature ranking
print("Feature ranking:")

for f in range(X_test.shape[1]):
    print(nom_columnas[f]+" (%f)" % (  importances[indices[f]]))

In [None]:
# Draw the desission tree to understand the Random Forest clasification
nom_features=X_test.columns.tolist()
nom_features_y=y_test.tolist()
estimator = rfc.estimators_[5]

In [None]:
from sklearn.tree import export_graphviz
import graphviz
# Export as dot file
graph = export_graphviz(estimator, 
                feature_names = nom_features,
                class_names = nom_features_y,
                rounded = True, proportion = False, 
                precision = 2, filled = True)
graphviz.Source(graph)

LINEAR REGRESSION OF with Severe Wasting target.


In [None]:
dataset_completeness_reg_lineal.columns

In [None]:
dataset_completeness_reg_lineal

In [None]:
# Take exclude the categorical variables
dataset_completeness_reg_lineal=dataset_completeness_reg_lineal.drop(['Country_Region','World_Bank_Name','Country','Country_Region_num','pctje_Muerte_por_malnutricion',
       'Num_Muerte_por_malnutricion', 'Pais_malnutrido'], axis=1)

In [None]:
#Load necessary libraries for Linear Regresion
import matplotlib.pyplot as plt
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Plot all variables-- distribution 
dataset_completeness_reg_lineal.hist()
plt.show()

In [None]:
# Take start Linear Regression Model
regr = linear_model.LinearRegression()


In [None]:
dataset_completeness_reg_lineal=dataset_completeness_reg_lineal.fillna(0)

In [None]:
dataset_completeness_reg_lineal

In [None]:
# Take the target
y=dataset_completeness_reg_lineal['Severe Wasting']

# Taking out the target of dataset X 

x=dataset_completeness_reg_lineal.drop(['Severe Wasting'], axis=1)

#split of datset xtrain e ytrain
X_train,X_test, y_train, y_test=train_test_split( x, y, test_size=0.4, random_state=42)


# Training the model
regr.fit(X_train, y_train)
 
# We make the predictions that ultimately one line (in this case, being 2D)
y_pred = regr.predict(X_test)
 
# Let's see the coefficients obtained. In our case, they will be the Tangent
print('Coefficients: \n', regr.coef_)
# This is the value where the Y axis cuts (in X=0)
print('Independent term: \n', regr.intercept_)
# Mean Square Error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# Variance Score. The best score is 1.0
print('Variance score: %.2f' % r2_score(y_test, y_pred))

In [None]:
y_pred

Take convert the prediction into a dataframe

In [None]:
# Let's repeat the same operation transforming the prediction into a data frame
df_pred = pd.DataFrame(data=y_pred,  columns=["pred"])

In [None]:
# How many elements do we have in the test (evaluation) set?
y_test.count()

In [None]:
y_test=y_test.to_frame()

In [None]:
# join up together
prediction=pd.merge(X_test, y_test, left_index=True, right_index=True)

In [None]:
# We create the indox to prediction
prediction.reset_index(inplace=True)

In [None]:
# Lets predict on dataset 
prediction_completeness=pd.merge(prediction, df_pred, left_index=True, right_index=True)

In [None]:
# Show prediction completeness
prediction_completeness

In [None]:
prediction_completeness.to_excel('dataset_reg_lineal_pred.xlsx')

In [None]:
X_test.columns

In [None]:
#Show the hyperparameters of the model
regr

In [None]:
#
regr.coef_

Y_pred = term indep + -2,71e-01*10,2+ -2.64339594e-02*5,1...... m1 X1 + m2 X2 + … + m(n) X(n)

In [None]:
df_columns = pd.DataFrame(X_test.columns.tolist(),  columns =['columns'])

In [None]:
df_coef = pd.DataFrame(data=regr.coef_, columns=["coeficientes"])

In [None]:
df_coeficientes = pd.merge(df_columns, df_coef, left_index=True, right_index=True)

Save the dataframe of coeficientes to excel file

In [None]:
df_coeficientes.to_excel('coeficientes_reg_lineal.xlsx')

**Decision Tree Regression**

In [None]:
# Decision Tree of LinealRegression
from sklearn.tree import DecisionTreeRegressor

In [None]:
# hiperparameter
regr_1 = DecisionTreeRegressor(max_depth=5)


In [None]:
# Separate ytest and xpred from dataset
y=dataset_completeness_reg_lineal['Severe Wasting']

x=dataset_completeness_reg_lineal.drop(['Severe Wasting'], axis=1)

X_train,X_test, y_train, y_test=train_test_split( x, y, test_size=0.4, random_state=42)


# Training the model
regr_1.fit(X_train, y_train)
 
# Making the prediction that ultimately one line 
y_pred = regr_1.predict(X_test)
 
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# Variance Score. The best score is 1.0
print('Variance score: %.2f' % r2_score(y_test, y_pred))

print('Score: %.2f' % regr_1.score(X_test, y_test))


In [None]:
regr_1.get_params()

In [None]:
#Plot linear regression
dataset_completeness_reg_lineal.columns

In [None]:
# Pinting the Linear regression of *Level of current health expenditure expressed as a percentage of GDP *.
x_draw=dataset_completeness_reg_lineal['Health_exp_pct_GDP_2016']

X_test_draw=X_test['Health_exp_pct_GDP_2016']
plt.figure()
#plt.scatter(x_draw, y, s=20, edgecolor="black",
 #           c="darkorange", label="data")
plt.plot(X_test, y_test, color="cornflowerblue",
         label="max_depth=2", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

 Data containt outlier value
