### Notation

I am going to use a different notation from what I am used to as it allows me to express complex ideas for n weights. \
$x = (x_0, x_1, \ldots, x_n)$
where $x$ is the feature vector for n vector and $x_{0}= 1$ \
$\theta = \begin{bmatrix}
    \theta_{0} \\
    \theta_{1} \\
    \ldots \\
    \theta_{n} \\ 
\end{bmatrix}$ where $\theta$ is the weight vector\
$h_{\theta}(x)$ represents the hypothesis or the predicted value given x parameritized by $\theta$ which is the weight \
$\theta^{T}(x) = \theta^{T} \cdot x = (\theta_{0}, \theta_{1}, \ldots, \theta_{n}) \cdot (x_0, x_1, \ldots, x_n) = \sum_{i=0}^{n} \theta_{i}x_{i} $ represents linear regression model. \
When I do probabilities like $P(y|x;\theta)$ it just means "probablity of y given x parameterized". \
m will denote the size of the dataset. \
$\vec y$ means the entire set of output vectors \
The superscript $x^{(j)}$ will mean jth example of x.
$\theta$ is the learning rate.

### Logistic Regression Intution

Logistic Regression is an ML algorithm algorithm deals with binary discrete data (so it's either 0 or 1) since we're dealing with probabilites. \
It relies on the assumption that each outcome independent from the other (important to keep in mind for conditional probability). \
The algorithm uses the sigmoid function to narrow the range of output $h_{\theta}(x) \in [0, 1]$.
$$g(z) = \frac{1}{\exp(-z) + 1}$$
$$h_{\theta}(x) = g(\theta^{T}(x))$$
The probablities for each y either becomes
$$P(y = 1|x;\theta) = h_{\theta}(x)$$
or
$$P(y = 0|x;\theta) = 1 - h_{\theta}(x)$$
Since we're only dealing with $y \in {0, 1}$, then we can do the following.
$$P(y|x;\theta) = (h_{\theta}(x))^{y}(1 - h_{\theta}(x))^{1-y}$$
Generalizing it for $\vec y$ and also using the fact that we're dealing with independent events, we get
$$P(\vec y|x;\theta) = \prod_{j=0}^{m}P(y^{(j)}|x^{(j)};\theta)$$
Simplifying the computation (floating error) and also for easier algebric manipulation, we'll use log instead as the likelihood.
$$L(\theta) = \log{P(\vec y|x;\theta)} = \log{\left(\prod_{j=0}^{m}P(y^{(j)}|x^{(j)};\theta)\right)} = \log{\left(\prod_{j=0}^{m}(h_{\theta}(x^{(j)}))^{y^{(j)}}(1 - h_{\theta}(x^{(j)}))^{1-y^{(j)}}\right)}$$
$$L(\theta) = \sum_{j=0}^{m}\left(y^{(j)}\log{(h_{\theta}(x^{(j)}))}+ (1- y^{(j)})\log{(1 - h_{\theta}(x^{(j)}))}\right) $$
Instead of dealing with minizing a cost function, we'll be maximizing the likelihood using gradient ascent
$$\theta_{i} := \theta_{i} + \alpha\frac{\partial}{\partial \theta_{i}}L(\theta)$$
Which turns out to be very similar to linear regression
$$\theta_{i} := \theta_{i} + \alpha\sum_{j=0}^{m}(y^{j} - h_{\theta}(x^{(j)}))x_{i}^{(j)}$$


In [2]:
import pandas as pd

X_test = pd.read_csv("test.csv")

X_test = X_test.drop(columns=["Ticket", "Cabin", "PassengerId", "Name", "Sex"])
X_test = pd.get_dummies(X_test)
X_test = X_test.rename(
    columns={"Embarked_C": "C", "Embarked_Q": "Q", "Embarked_S": "S"}
)
X_test["C"] = X_test["C"].astype(int)
X_test["Q"] = X_test["Q"].astype(int)
X_test["S"] = X_test["S"].astype(int)
X_test = X_test.fillna(X_test.mean())
print(X_test)
X_test = X_test.to_numpy()

     Pclass       Age  SibSp  Parch      Fare  C  Q  S
0         3  34.50000      0      0    7.8292  0  1  0
1         3  47.00000      1      0    7.0000  0  0  1
2         2  62.00000      0      0    9.6875  0  1  0
3         3  27.00000      0      0    8.6625  0  0  1
4         3  22.00000      1      1   12.2875  0  0  1
..      ...       ...    ...    ...       ... .. .. ..
413       3  30.27259      0      0    8.0500  0  0  1
414       1  39.00000      0      0  108.9000  1  0  0
415       3  38.50000      0      0    7.2500  0  0  1
416       3  30.27259      0      0    8.0500  0  0  1
417       3  30.27259      1      1   22.3583  1  0  0

[418 rows x 8 columns]


In [3]:
X = pd.read_csv("train.csv")
y = X["Survived"].to_numpy()
X = X.drop(columns=["Ticket", "Cabin", "PassengerId", "Name", "Sex", "Survived"])
X = pd.get_dummies(X)
X = X.rename(columns={"Embarked_C": "C", "Embarked_Q": "Q", "Embarked_S": "S"})
X["C"] = X["C"].astype(int)
X["Q"] = X["Q"].astype(int)
X["S"] = X["S"].astype(int)
X = X.fillna(X.mean())
X = X.to_numpy()
X.shape

(891, 8)

In [4]:
X_train = X[:91]
X_test = X[91:-1]
y_train = y[:91]
y_test = y[91:-1]

In [5]:
X_train = X
y_train = y

In [6]:
import numpy as np
import matplotlib.pyplot as plt


class LogisticRegression:
    def __init__(self, X):
        self.m = len(X)
        self.n = X.shape[-1]
        self.weights = np.zeros(self.n + 1)

    def sigmoid(self, z):
        return 1 / (np.exp(-z) + 1)

    def linear_regression(self, x_j):
        weights_without_bias = self.weights[:-1]
        return np.dot(weights_without_bias, x_j) + self.weights[-1]

    def hypothesis(self, x_j):
        return self.sigmoid(self.linear_regression(x_j))

    def probability_of_y_given_x(self, y_j, x_j):
        hypothesis = self.hypothesis(x_j)
        return (hypothesis) ** (y_j) * (1 - hypothesis) ** (1 - y_j)

    def likelihood(self, X, y):
        return np.sum(
            [
                y[j] * np.log(self.hypothesis(X[j]))
                + (1 - y[j]) * np.log(1 - self.hypothesis(X[j]))
                for j in range(self.m)
            ]
        )

    def predict(self, X):
        return [np.round(model.hypothesis(X[j])).astype(int) for j in range(len(X))]

    def gradient_ascent(self, X, y, i):
        return L * np.sum(
            [(y[j] - self.hypothesis(X[j])) * X[j][i] for j in range(self.m)]
        )

    def fit(self, X, y):
        for i in range(len(self.weights) - 1):
            self.weights[i] += self.gradient_ascent(X, y, i)
        self.weights[-1] += L * np.sum(
            [(y[j] - self.hypothesis(X[j])) for j in range(self.m)]
        )


L = 0.001
model = LogisticRegression(X_train)
epochs = 1000
for _ in range(epochs):
    # prediction = model.predict(X_train, y_train)
    model.fit(X_train, y_train)
y_pred = pd.DataFrame(
    model.predict(X_test), columns=["Survived"], index=np.arange(892, 892 + len(X_test))
)
y_pred.index.name = "PassengerId"
y_pred.to_csv("./submission.csv")

  return 1 / (np.exp(-z) + 1)
