# Question 1

For this question use the World Bank Data for Turkey for the following indicators. Use [wbgapi](https://pypi.org/project/wbgapi/) for getting the data.

* [Literacy rate, adult female (SE.ADT.LITR.FE.ZS)](https://data.worldbank.org/indicator/SE.ADT.LITR.FE.ZS)
* [Labor force, female (SL.TLF.TOTL.FE.ZS)](https://data.worldbank.org/indicator/SL.TLF.TOTL.FE.ZS)
* [Poverty headcount ratio at national poverty lines (SI.POV.NAHC)](https://data.worldbank.org/indicator/SI.POV.NAHC)
* [Current health expenditure per capita (SH.XPD.CHEX.PC.CD)](https://data.worldbank.org/indicator/SH.XPD.CHEX.PC.CD)
* [GDP per capita (NY.GDP.PCAP.CD)](https://data.worldbank.org/indicator/NY.GDP.PCAP.CD)
* [Mortality rate, under-5 (SH.DYN.MORT)](https://data.worldbank.org/indicator/SH.DYN.MORT)


Using the [statsmodels](https://www.statsmodels.org/stable/index.html) library write the best linear regression model using child mortality as the dependent variable while the rest are considered as independent variables. Pay particular attention to the fact that the order of the variables put into the model significantly impacts the performance of the model. Choose the best model by considering

* with the minimum number of variables and their interactions,
* with the optimal ordering of the independent variables and their interactions,
* $R^2$-score of the model,
* statistical significance of the model coefficients,
* ANOVA analysis of the model.


In [None]:
pip install wbgapi


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wbgapi
  Downloading wbgapi-1.0.12-py3-none-any.whl (36 kB)
Installing collected packages: wbgapi
Successfully installed wbgapi-1.0.12


In [None]:
import pandas as pd
import wbgapi as wb
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet

In [None]:
def pull_data(indicator):
    #return pd.DataFrame(list(wb.data.fetch(indicator)))
    return wb.data.DataFrame(indicator, "TUR").T

indicators = { "literacy_rate" : 'SE.ADT.LITR.FE.ZS',
                "labor_force" : 'SL.TLF.TOTL.FE.ZS',
                "poverty_hc_ratio" : 'SI.POV.NAHC',
                "c_health_exp" : 'SH.XPD.CHEX.PC.CD',
                "gdp" : 'NY.GDP.PCAP.CD',
                "mortality_rate" : 'SH.DYN.MORT'              
                }
df = pull_data(indicators.values())

This is a function of a friend of mine. I have done the same thing with brute force. I like the function and copied from him. I simply 

In [None]:
df = df.ffill().fillna(0)

In [None]:
df = df.rename(columns ={"NY.GDP.PCAP.CD": "GDP", "SL.TLF.TOTL.FE.ZS": "labor_force" ,"SI.POV.NAHC" : "pov_headcount", "SH.XPD.CHEX.PC.CD":"health_exp" , "SH.DYN.MORT":"mortality_rate","SE.ADT.LITR.FE.ZS":"literacy_rate" })
df

series,GDP,literacy_rate,mortality_rate,health_exp,pov_headcount,labor_force
YR1960,509.005545,0.000000,257.0,0.000000,0.0,0.000000
YR1961,283.828284,0.000000,249.3,0.000000,0.0,0.000000
YR1962,309.446624,0.000000,241.4,0.000000,0.0,0.000000
YR1963,350.662985,0.000000,233.5,0.000000,0.0,0.000000
YR1964,369.583469,0.000000,225.7,0.000000,0.0,0.000000
...,...,...,...,...,...,...
YR2017,10589.667725,93.498268,11.4,442.617615,13.9,32.799757
YR2018,9454.348443,93.498268,10.7,389.865570,14.4,33.089766
YR2019,9121.515167,94.424042,10.1,396.466827,15.0,33.360649
YR2020,8536.433320,94.424042,9.5,396.466827,15.0,32.175606


I chose a significance level 90% for my data

In [None]:
from statsmodels.formula.api import ols

In [None]:
X = df[['GDP', 'labor_force','health_exp', 'pov_headcount','pov_headcount' ]]
Y = df['mortality_rate']
XX = sm.add_constant(X)
model = sm.OLS(Y, XX)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:         mortality_rate   R-squared:                       0.831
Model:                            OLS   Adj. R-squared:                  0.819
Method:                 Least Squares   F-statistic:                     69.84
Date:                Mon, 07 Nov 2022   Prob (F-statistic):           2.63e-21
Time:                        16:27:14   Log-Likelihood:                -301.66
No. Observations:                  62   AIC:                             613.3
Df Residuals:                      57   BIC:                             624.0
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const           191.5236      8.629     22.194

  x = pd.concat(x[::order], 1)


R-squared value is less than 90% So I try another one which is variables and their interactions.

*This* model is not exactly what we want. I only can say that from R-squared. It it not too low but less than we want (>90%).

In [None]:
model = ols('mortality_rate ~ GDP	* literacy_rate *	health_exp * pov_headcount * labor_force', data=df).fit()
print(model.summary())
sm.stats.anova_lm(model)

                            OLS Regression Results                            
Dep. Variable:         mortality_rate   R-squared:                       0.963
Model:                            OLS   Adj. R-squared:                  0.933
Method:                 Least Squares   F-statistic:                     32.41
Date:                Mon, 07 Nov 2022   Prob (F-statistic):           2.77e-17
Time:                        16:30:26   Log-Likelihood:                -254.82
No. Observations:                  62   AIC:                             565.6
Df Residuals:                      34   BIC:                             625.2
Df Model:                          27                                         
Covariance Type:            nonrobust                                         
                                                             coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
GDP,1.0,231023.078656,231023.078656,582.473253,5.639295000000001e-23
literacy_rate,1.0,104256.413658,104256.413658,262.859333,1.444287e-17
GDP:literacy_rate,1.0,5187.607136,5187.607136,13.079396,0.0009569662
health_exp,1.0,144.233484,144.233484,0.363653,0.5504864
GDP:health_exp,1.0,4038.750712,4038.750712,10.182811,0.003045504
literacy_rate:health_exp,1.0,87.540457,87.540457,0.220714,0.6414959
GDP:literacy_rate:health_exp,1.0,745.552087,745.552087,1.879744,0.1793459
pov_headcount,1.0,124.675494,124.675494,0.314341,0.5787045
GDP:pov_headcount,1.0,599.159964,599.159964,1.510648,0.227479
literacy_rate:pov_headcount,1.0,6.971406,6.971406,0.017577,0.8953089


Since we chose a significance level of 10% we should look at the  P>|t| column. The ratios that less than 10% is significant for our statistical model. These variables are GDP and literacy rate. 

In [None]:
model = ols('mortality_rate ~ GDP	* literacy_rate', data=df).fit()
print(model.summary())
sm.stats.anova_lm(model)

                            OLS Regression Results                            
Dep. Variable:         mortality_rate   R-squared:                       0.944
Model:                            OLS   Adj. R-squared:                  0.941
Method:                 Least Squares   F-statistic:                     327.1
Date:                Mon, 07 Nov 2022   Prob (F-statistic):           2.70e-36
Time:                        16:43:20   Log-Likelihood:                -267.23
No. Observations:                  62   AIC:                             542.5
Df Residuals:                      58   BIC:                             551.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept           226.4692      6.08

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
GDP,1.0,231023.078656,231023.078656,665.834329,1.760914e-33
literacy_rate,1.0,104256.413658,104256.413658,300.478634,1.303975e-24
GDP:literacy_rate,1.0,5187.607136,5187.607136,14.951263,0.0002814121
Residual,58.0,20124.132969,346.96781,,


As we see with these two variables we found a 94.4% R-squared value that is inside of our chosen interval. I think that is the best we can chose to explain Chield Mortality Rate.




# Question 2

For this question use Yahoo's Finance API for the following tickers:

* Gold futures (GC=F)
* Silver futures (SI=F)
* Copper futures (HG=F)
* Platinum futures (PL=F)

1. Write the best linear regression model that explains gold futures closing prices in terms of opening prices of gold, silver, copper, and platinum futures.
2. Repeat the same for silver, copper and platinum prices.
3. Compare the models you obtained in Steps 1 and 2. Which model is better? How do you decide? Explain.

In [None]:
from yahoo_fin.stock_info import get_data;

I import the yahoo's get_data method to pull data from the API. It takes the 5 arguments which is very helpful.

In [None]:
def pull_data_yahoo(ticker):
  pull_data_yahoo = get_data(ticker, start_date =" 09/07/1999", end_date = "10/07/2022", index_as_date = True, interval = '1d').fillna(0)
  return pull_data_yahoo

I also build a new function to pull data from yahoo finance API. This is done all by me. Not like the first one.

In [None]:
gc_f = pull_data_yahoo("GC=F").ffill().dropna()
sil_f = pull_data_yahoo("SI=F").ffill().dropna()
hg_f = pull_data_yahoo("HG=F").ffill().dropna()
plat_f = pull_data_yahoo("PL=F").ffill().dropna()

I pull the data and firstly tried to use "dropna()" method to get rid of NA data. However it did not worked well so I decided to use "fillna(0)" and "ffill()" -AKA forward fill it takes the last value and put the rest the same value untill it sees a new value. In order not the block these unknown data for my model

In [None]:
new_df= pd.DataFrame()
new_df["gold_close"] = gc_f["close"]
new_df["gold_open"] =gc_f["open"]
new_df["silver_open"] =sil_f["open"]
new_df["silver_close"] =sil_f["close"]
new_df["copper_open"] =hg_f["open"]
new_df["copper_close"] =hg_f["close"]
new_df["platinum_open"] =plat_f["open"]
new_df["platinum_close"] =plat_f["close"]
new_df

Unnamed: 0,gold_close,gold_open,silver_open,silver_close,copper_open,copper_close,platinum_open,platinum_close
2000-08-30,273.899994,273.899994,4.950000,4.930000,0.8790,0.8850,593.900024,591.400024
2000-08-31,278.299988,274.799988,4.920000,5.003000,0.8850,0.8850,589.000000,586.700012
2000-09-01,277.000000,277.000000,5.035000,5.004000,0.8780,0.8890,588.000000,595.299988
2000-09-04,0.000000,0.000000,0.000000,0.000000,0.0000,0.0000,0.000000,0.000000
2000-09-05,275.799988,275.799988,4.990000,4.998000,0.8960,0.9060,602.000000,601.299988
...,...,...,...,...,...,...,...,...
2022-09-30,1662.400024,1661.699951,18.844999,18.959999,3.4440,3.4420,869.000000,870.000000
2022-10-03,1692.900024,1667.199951,20.170000,20.518999,3.4000,3.4580,895.000000,911.299988
2022-10-04,1721.099976,1701.199951,20.820000,21.037001,3.4640,3.5180,911.299988,943.700012
2022-10-05,1711.400024,1724.099976,20.795000,20.479000,3.5290,3.5325,946.400024,924.599976


I created a new DataFrame to put the needed values for the model.

In [None]:
model = ols('gold_close ~ gold_open *silver_open*copper_open*platinum_open', data=new_df).fit()
print(model.summary())
sm.stats.anova_lm(model)

                            OLS Regression Results                            
Dep. Variable:             gold_close   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 8.532e+05
Date:                Mon, 07 Nov 2022   Prob (F-statistic):               0.00
Time:                        18:55:17   Log-Likelihood:                -21457.
No. Observations:                5631   AIC:                         4.295e+04
Df Residuals:                    5615   BIC:                         4.305e+04
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                                                      coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
gold_open,1.0,1533650000.0,1533650000.0,12798540.0,0.0
silver_open,1.0,0.06045732,0.06045732,0.0005045254,0.982081
gold_open:silver_open,1.0,390.6405,390.6405,3.259954,0.071045
copper_open,1.0,0.1257966,0.1257966,0.001049792,0.974154
gold_open:copper_open,1.0,1061.514,1061.514,8.85849,0.00293
silver_open:copper_open,1.0,160.88,160.88,1.342568,0.246631
gold_open:silver_open:copper_open,1.0,26.3056,26.3056,0.2195242,0.639421
platinum_open,1.0,0.8533687,0.8533687,0.00712149,0.93275
gold_open:platinum_open,1.0,25.47339,25.47339,0.2125792,0.644771
silver_open:platinum_open,1.0,15.06296,15.06296,0.1257027,0.722943


OLS results are significantly accurate. R-squarred value is one. What can be better than that?. P>|t| also says that **gold_open** interaction of **gold_open and silver_open / gold_open, silver_open and copper_open/ gold_open, silver_open and platinum_open/ gold_open, silver_open, copper_open, platinum_open** explains the gold_close price well.

In [None]:
X = new_df[['silver_open','copper_open', 'gold_open','platinum_open']]
XX = sm.add_constant(X)
Y = new_df['gold_close']
model = sm.OLS(Y,XX)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:             gold_close   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.191e+06
Date:                Mon, 07 Nov 2022   Prob (F-statistic):               0.00
Time:                        18:56:01   Log-Likelihood:                -21470.
No. Observations:                5631   AIC:                         4.295e+04
Df Residuals:                    5626   BIC:                         4.298e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.0862      0.395      0.218

  x = pd.concat(x[::order], 1)


Also tried another way the for model. It also gave me a good solution. I can say that just from R-squared value. P>|t| colums says us only **gold_open** explains data well. Other variables' intervals contains zero. This means they are not a good choice.

In [None]:
model = ols('silver_close ~ gold_open *silver_open*copper_open*platinum_open', data=new_df).fit()
print(model.summary())
sm.stats.anova_lm(model)

                            OLS Regression Results                            
Dep. Variable:           silver_close   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                 3.031e+05
Date:                Mon, 07 Nov 2022   Prob (F-statistic):               0.00
Time:                        09:59:24   Log-Likelihood:                -1223.4
No. Observations:                5631   AIC:                             2479.
Df Residuals:                    5615   BIC:                             2585.
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                                                      coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
gold_open,1.0,320629.779163,320629.779163,3536098.0,0.0
silver_open,1.0,91614.16548,91614.16548,1010376.0,0.0
gold_open:silver_open,1.0,0.441197,0.441197,4.865782,0.027435
copper_open,1.0,0.041453,0.041453,0.4571657,0.498979
gold_open:copper_open,1.0,0.480401,0.480401,5.29815,0.021385
silver_open:copper_open,1.0,0.533007,0.533007,5.878326,0.01536
gold_open:silver_open:copper_open,1.0,0.014202,0.014202,0.1566317,0.692292
platinum_open,1.0,0.348728,0.348728,3.845987,0.049915
gold_open:platinum_open,1.0,0.843397,0.843397,9.301485,0.0023
silver_open:platinum_open,1.0,0.174593,0.174593,1.925516,0.165305


In the OLS regression results it can be seen that **silver_open** and a interaction of **gold_open, silver_open and copper_open** gives a good data set to explain silver close price. 

In [None]:
X = new_df[['silver_open','copper_open', 'gold_open','platinum_open']]
XX = sm.add_constant(X)
Y = new_df['silver_close']
model = sm.OLS(Y,XX)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:           silver_close   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                 1.130e+06
Date:                Mon, 07 Nov 2022   Prob (F-statistic):               0.00
Time:                        19:02:30   Log-Likelihood:                -1245.1
No. Observations:                5631   AIC:                             2500.
Df Residuals:                    5626   BIC:                             2533.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -0.0036      0.011     -0.335

  x = pd.concat(x[::order], 1)


p-values of silver_open and copper_open seem good for the model but the copper_open's interval contains zero. so only silver_open may explain this model.

In [None]:
model = ols('copper_close ~ gold_open *silver_open*copper_open*platinum_open', data=new_df).fit()
print(model.summary())
sm.stats.anova_lm(model)

                            OLS Regression Results                            
Dep. Variable:           copper_close   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                 3.182e+05
Date:                Mon, 07 Nov 2022   Prob (F-statistic):               0.00
Time:                        10:08:26   Log-Likelihood:                 10303.
No. Observations:                5631   AIC:                        -2.057e+04
Df Residuals:                    5615   BIC:                        -2.047e+04
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                                                      coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
gold_open,1.0,4662.980007,4662.980007,3083958.0,0.0
silver_open,1.0,595.297128,595.297128,393712.0,0.0
gold_open:silver_open,1.0,506.42692,506.42692,334935.8,0.0
copper_open,1.0,1452.475361,1452.475361,960624.4,0.0
gold_open:copper_open,1.0,1.2e-05,1.2e-05,0.007702267,0.930069
silver_open:copper_open,1.0,0.009536,0.009536,6.30655,0.012057
gold_open:silver_open:copper_open,1.0,1e-06,1e-06,0.0008616319,0.976584
platinum_open,1.0,0.009259,0.009259,6.1235,0.013369
gold_open:platinum_open,1.0,0.000854,0.000854,0.5647872,0.452369
silver_open:platinum_open,1.0,5.8e-05,5.8e-05,0.03804791,0.845354


In the anova table we can say from PR(>F) column -our interal is still 90%- interaction of **gold_open, silver_open and copper_open** and **gold_open and copper_open** are higher than our value. That means that we can chose these interactions to explain **copper_close** price 

In [None]:
X = new_df[['silver_open','copper_open', 'gold_open','platinum_open']]
XX = sm.add_constant(X)
Y = new_df['copper_close']
model = sm.OLS(Y,XX)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:           copper_close   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                 1.192e+06
Date:                Mon, 07 Nov 2022   Prob (F-statistic):               0.00
Time:                        19:02:31   Log-Likelihood:                 10294.
No. Observations:                5631   AIC:                        -2.058e+04
Df Residuals:                    5626   BIC:                        -2.054e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.0011      0.001      0.768

  x = pd.concat(x[::order], 1)


As we all see in the all model that we try to explain close value of the meterials all of the above are really accurate to explain each other.

# Question 3

1. Write a function that takes a ticker symbol and returns a pandas dataframe that for each day puts a 1 when the closing price is higher than the opening price, a 0 when the closing price is lower than the opening price.
2. Write the best logistic regression that predicts the time series you obtain from Step 1 for gold futures against the opening prices of gold, silver, copper, and platinum prices.
3. Repeat the same for silver, copper, and platinum prices.
4. Compare the models you obtained from Steps 2 and 3. Decide which is the best model, and explain your reasoning.
5. Does any of the models provide a good fit? Explain.

In [85]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from zipfile import ZipFile
from io import BytesIO
from urllib.request import urlopen
from collections import Counter
from sklearn.metrics import confusion_matrix

from statsmodels.formula.api import logit
from statsmodels.api import Logit

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [86]:
def open_close(ticker):
  df_open_close = get_data(ticker, start_date =" 09/07/2022", end_date = "10/07/2022", index_as_date = True, interval = '1d').fillna(0)
  df_open_close["open_close"] = df_open_close["open"]
  for index in df_open_close.index:
    if df_open_close.loc[index]["open"] >= df_open_close.loc[index]["close"]:
        df_open_close["open_close"][index] = 0
    elif df_open_close.loc[index]["open"] < df_open_close.loc[index]["close"]:
        df_open_close["open_close"][index] = 1
  return df_open_close.dropna()

In [94]:
df_gold= open_close('GC=F') 
df_silv = open_close("SI=F") 
df_coop= open_close("HG=F") 
df_plat = open_close("PL=F") 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/ind

In [90]:
def logistic_reg(data):
  model = logit('open_close ~ high + low + open + volume', data=data).fit()
  return model.summary()

In [91]:
logistic_reg(df_gold)

Optimization terminated successfully.
         Current function value: 0.409103
         Iterations 8


0,1,2,3
Dep. Variable:,open_close,No. Observations:,22.0
Model:,Logit,Df Residuals:,17.0
Method:,MLE,Df Model:,4.0
Date:,"Mon, 07 Nov 2022",Pseudo R-squ.:,0.4062
Time:,19:38:00,Log-Likelihood:,-9.0003
converged:,True,LL-Null:,-15.158
Covariance Type:,nonrobust,LLR p-value:,0.01515

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-19.0302,38.473,-0.495,0.621,-94.435,56.375
high,0.2255,0.125,1.803,0.071,-0.020,0.471
low,0.0982,0.099,0.997,0.319,-0.095,0.291
open,-0.3132,0.159,-1.969,0.049,-0.625,-0.001
volume,0.0003,0.000,0.720,0.472,-0.001,0.001


In [92]:
logistic_reg(df_coop)

Optimization terminated successfully.
         Current function value: 0.273219
         Iterations 9


0,1,2,3
Dep. Variable:,open_close,No. Observations:,22.0
Model:,Logit,Df Residuals:,17.0
Method:,MLE,Df Model:,4.0
Date:,"Mon, 07 Nov 2022",Pseudo R-squ.:,0.6035
Time:,19:38:04,Log-Likelihood:,-6.0108
converged:,True,LL-Null:,-15.158
Covariance Type:,nonrobust,LLR p-value:,0.001081

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-8.5613,30.750,-0.278,0.781,-68.831,51.708
high,62.1077,39.298,1.580,0.114,-14.915,139.131
low,113.6448,68.879,1.650,0.099,-21.355,248.644
open,-173.8907,94.613,-1.838,0.066,-359.329,11.547
volume,0.0075,0.007,1.003,0.316,-0.007,0.022


High and low prices are significant. Other ones are not because of the p_values

Because of there are some rows that has the same value in different colums the variable that I have chosen in my "logistic_reg" function I decide to use other variables in the code below.

In [97]:
model = logit('open_close ~  open + volume', data=df_silv).fit()
model.summary()


Optimization terminated successfully.
         Current function value: 0.603329
         Iterations 6


0,1,2,3
Dep. Variable:,open_close,No. Observations:,22.0
Model:,Logit,Df Residuals:,19.0
Method:,MLE,Df Model:,2.0
Date:,"Mon, 07 Nov 2022",Pseudo R-squ.:,0.1296
Time:,19:39:39,Log-Likelihood:,-13.273
converged:,True,LL-Null:,-15.249
Covariance Type:,nonrobust,LLR p-value:,0.1386

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,3.9923,11.525,0.346,0.729,-18.596,26.581
open,-0.2647,0.596,-0.444,0.657,-1.433,0.903
volume,0.0143,0.009,1.636,0.102,-0.003,0.031


# Question 4

For this question use the following [data](https://archive.ics.uci.edu/ml/datasets/credit+approval):


In [None]:
credit = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data', header=None)

fn = {'+': 1, '-': 0}

X = credit.replace('?',0).iloc[:,[1,2,7,10,14]]
y = credit.iloc[:,15].map(lambda x: fn.get(x,0))

1. Split the data into training and test set.
2. Write different logistic regression models predicting y against X.
3. Construct [confusion matrices](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) on the test data set for these different models.
4. Analyze these models. Explain which model is the best model you have found.
5. Repeat Steps 1-4 several times. Does your best model stay as the best model? What should be the correct protocol to decide on the best model explaining the data?

In [None]:
X

Unnamed: 0,1,2,7,10,14
0,30.83,0.000,1.25,1,0
1,58.67,4.460,3.04,6,560
2,24.50,0.500,1.50,0,824
3,27.83,1.540,3.75,5,3
4,20.17,5.625,1.71,0,0
...,...,...,...,...,...
685,21.08,10.085,1.25,0,0
686,22.67,0.750,2.00,2,394
687,25.25,13.500,2.00,1,1
688,17.92,0.205,0.04,0,750


In [None]:
X["y"] = y

In [None]:
X =X.rename(columns ={1:"one", 2:"two", 7:"seven", 10:"ten", 14:"forteen"})
X

In [None]:
model = logit('y ~ one + two + seven + ten + forteen', data=X).fit()
model.summary()

         Current function value: inf
         Iterations: 35


LinAlgError: ignored