## Xi Lin's Machine Learning Project Repo
- [Github Repo Link](https://github.com/xl3283/Machine-Learning-Project)

# 1. World Happiness Report Analysis

In this project, I analyzed the World Happiness Report dataset to predict happiness scores for countries based on various socio-economic factors. The dataset includes GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity, and perceptions of corruption.

## Data Source 
- [World Happiness Report](https://worldhappiness.report/)

## Project Repository

- [Jupyter Notebook](https://github.com/xl3283/Machine-Learning-Project/blob/main/Project%20List/U.N.%20World%20Happiness%20Data%20Project.ipynb)

## Data Visualization 
![image.png](attachment:image.png) 
As clearly show in the graph, when the GDP per capita increase, the average will also increase at the same time. That means the GDP is a very important factor for people's happiness.

## Data Preprocessing

The dataset was preprocessed using a combination of techniques, including:

1. Imputation: Replacing missing values in numeric features with their median and in categorical features with their mode.
2. Scaling: Standardizing numeric features using StandardScaler.
3. Encoding: One-hot encoding categorical features.

## Predictive Models

The following models were used to predict happiness scores:

1. GradientBoostingClassifier('learning_rate': 1.5, 'max_depth': 5, 'n_estimators': 50, 'random_state': 4)
2. Support Vector Classifier ('C': 1, 'degree': 0.001, 'gamma': 0.1, 'kernel': 'rbf')
3. Neural Network (Keras, six layers)

## Model Evaluation

The models were evaluated using accuracy and f1 score. GridSearchCV was used to find the best hyperparameters for GradientBoostingClassifier and Support Vector Classifier. The best performing model was the GradientBoostingClassifier, with an accuracy score of 0.577.  

## Team Collaboration

Three more models were developed after learning from team members:

1. RandomForestClassifier (n_estimators=250, max_depth=10)
2. GradientBoostingClassifier (learning_rate=1.3, max_depth=1, random_state=0)
3. Support Vector Classifier (C=5, class_weight='balanced')

After learning from teammate, the best performing model among these was the GradientBoostingClassifier with an accuracy score of 0.5882 which improve from my previous result.

## Model Comparison

The performance of the models was compared to others on the competition leaderboard. My highest accuracy score is 0.5882 which is the second highest model in the competition! My other models also has high socre in the competition leaderboard which proves the successful of the paramemter that I use.  


# 2. Covid X-ray Diagnostic 

In this project, I analyze a dataset of X-ray images to predict Covid-19 pneumonia. The dataset consists of images that demonstrate Covid Positivity and those that do not. The purpose of this project is to build a predictive model that can help in screening viral and Covid-19 pneumonia.

## Dataset Citation

M.E.H. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M.A. Kadir, Z.B. Mahbub, K.R. Islam, M.S. Khan, A. Iqbal, N. Al-Emadi, M.B.I. Reaz, “Can AI help in screening Viral and COVID-19 pneumonia?” arXiv preprint, 29 March 2020, https://arxiv.org/abs/2003.13145 

## Project Repository

- [Jupyter Notebook](http://localhost:8888/notebooks/Desktop/AI/Machine-Learning-Project/Project%20List/Covid%20Positive%20X-Ray%20Image%20Data%20Project.ipynb#)


## Practical Applications

A predictive model using this dataset can be practically useful for medical professionals, hospitals, and diagnostic centers. This model can help in rapid and accurate diagnosis of Covid-19 pneumonia, enabling timely treatment and better patient care.  

## Visualize Images
![image.png](attachment:image.png) 
The first image come from the COVID people, the second image come from the normal people, and the third image come from the viral pneumonia people. They have very obvious different between three categories.

## Predictive Models

The following models were used to predict Covid-19 pneumonia from X-ray images:

1. Custom Convolutional Neural Network
2. Custom Convolutional Neural Network with More Layers
3. Custom Convolutional Neural Network with Additional Layers and Sigmoid Activatio


## Model Performance and Hyperparameters

Based on the accuracy results,the second model with an accuracy of 0.9163 appears to have performed the best. The first model with an accuracy of 0.33 has the lowest accuracy, while the third model with an accuracy of 0.8497 is in between the other two. For those three model, adding more layers will help my model's accuracy become better, but add too much layer still will decrease the accuracy. For result model two, I believe that increase the epoch will improve the result a lot because the layer between model2 and model3 is almost, and they only have the epoch difference.


## Team Collaboration

4. Custom Convolutional Neural Network with Adam optimizer, categorical crossentropy loss, and accuracy metric. 
5. Custom Convolutional Neural Network with  two Dense layers, one with ReLU activation and the other with softmax activation for classifying into 3 categories. 
6. Custom Convolutional Neural Network with BatchNormalization layers. 

My teammate increases the layers, add the dropout and use the BatchNormalization function which can increase the accuracy. However, when I use their function, my results are very low which all around 30% - 40%. Therefore, I think add too much layers will not help for the accuracy, but the droput function and BatchNormalization still can increase the final accuracy. So I think if I use the those functions to my first three models, I can still increase the final accuray more.  

## Augmented Data

The data is augmented using the ImageDataGenerator with rescale normalization and a validation split of 0.2. The training and validation generators are created using a batch size of 64 and a seed of 7. The model is then fit using the fit_generator method with the training and validation generators, with steps per epoch and validation steps calculated based on the length of X_train and the respective split proportions.
![image-2.png](attachment:image-2.png) 

## Conclusion
 
In this part, I use the training generator and validation generator to help me finish the data augmentation, and also finish the model predict. However, the final accuray still is not very high and similar to the last three models. Therefore, my second model which use 16-layer CNN and 10 epoch get 91% accuray performance best, so I believe that use less layers and more epochs can help us to improve the final result. 


# 3. Text Classification Using the Stanford SST Sentiment Dataset


## Dataset Description
The SST2 dataset is a collection of movie reviews and their corresponding sentiment labels. Building a predictive model using this data can be practically useful for several reasons. First, by training a model to accurately predict sentiment in movie reviews, we can improve our understanding of natural language and how to analyze it. Second, a well-performing model can be adapted to other domains and industries that require sentiment analysis, such as social media, customer reviews, or market research. 

Various stakeholders can benefit from a model like this, including:

1. Movie studios and distributors: A model that can accurately predict movie review sentiment can be useful for gauging audience reactions, understanding public opinion, and making data-driven decisions about marketing and distribution.
2. Social media platforms: These platforms can use sentiment analysis to understand user opinions on various topics, helping to inform content recommendations, advertising, and moderation efforts. 
3. Businesses: Companies can use the model to analyze customer feedback on their products or services, helping them to identify areas of improvement and capitalize on positive sentiment.


## Project Repository

- [Jupyter Notebook](https://github.com/xl3283/Machine-Learning-Project/blob/main/Project%20List/Stanford%20SST%20Sentiment%20Dataset%20Project.ipynb)
 

## Predictive Models

The following models were used to predict SST Sentiment Dataset:

1. Deep learning model(keras) use Embedding layer and LSTM layers 
2. Deep learning model(keras)use Embedding layer and Conv1d layers
3. Transfer learning with glove embeddings


## Model Performance and Hyperparameters

The basic LSTM and embedding layer model performed the best with an accuracy of 0.81. All models had similar hyperparameters in terms of epochs and batch_size. The differences in their architecture led to the observed differences in their performance. The LSTM model can capture longer-range dependencies in the input sequences, making it more suitable for sentiment analysis tasks where the context of the whole sentence matters.

The 1D convnets model, on the other hand, focuses more on detecting local patterns, which might not be as effective in capturing the overall sentiment. However, it still performed relatively well with an accuracy of 0.79.

The transfer learning model with GloVe embeddings had an accuracy of 0.80, which is also a good result. Pre-trained word embeddings, such as GloVe, can often improve the performance of models as they provide a more robust representation of words by leveraging the knowledge learned from large-scale text data. In this case, the GloVe-based model's performance was slightly lower than the LSTM model but higher than the 1D convnets model.

In conclusion, the basic LSTM and embedding layer model performed the best among the three models with the given hyperparameters. However, it is important to note that the differences in accuracy are relatively small, and further hyperparameter tuning or architecture adjustments might lead to better results for any of the models.


## Team Collaboration

4. Deep learning model(keras) use Embedding layer and Bidirectional LSTM
5. Deep learning model(keras)use more Embedding layer, Conv1d layers and GlobalMaxPooling1D
6. Deep learning model(keras) use more Embedding layer and Bidirectional LSTM

After discuss with my team, the accuracies I obtained for these new models are:

Model 1 (Bidirectional LSTM): accuracy = 0.77 Model 2 (MaxPooling1D for 1D ConvNets): accuracy = 0.71 Model 3 (Bidirectional LSTM and dropout techique): accuracy = 0.78 While these new models achieved good accuracies, they still underperformed compared to my first three models. This could be due to several reasons, such as differences in the model architectures, hyperparameters, or the nature of the dataset.

## Conclusion
 
First set of models:

Basic LSTM and embedding layer: accuracy = 0.81 

Basic embedding layer with 1D convnets: accuracy = 0.79 

Transfer learning with GloVe embeddings: accuracy = 0.80 

Second set of models:

Bidirectional LSTM: accuracy = 0.77 

MaxPooling1D for 1D ConvNets: accuracy = 0.71 

Bidirectional LSTM and dropout: accuracy = 0.78  


From the results, it is clear that the first set of models performed better than the second set. Among all six models, the best performing one is the basic LSTM and embedding layer model with an accuracy of 0.81. This model used 10 epochs and a batch size of 32.

The LSTM model's performance can be attributed to its ability to capture longer-range dependencies in the input sequences, which is highly relevant for sentiment analysis tasks. The architecture allowed the model to learn complex patterns and maintain context throughout the input sequence.

The other models also performed well but couldn't surpass the LSTM model. The GloVe-based transfer learning model achieved an accuracy of 0.80, while the basic 1D ConvNets model reached 0.79. The second set of models, which includes Bidirectional LSTM and MaxPooling1D for 1D ConvNets, had lower accuracies ranging from 0.71 to 0.78.

To summarize, the best performing model is the basic LSTM and embedding layer model with an accuracy of 0.81. The relevant hyperparameters for this successful model are 10 epochs and a batch size of 32. If want to further improve the performance of your models, I can try hyperparameter tuning, architecture adjustments, cross-validation, and regularization techniques as suggested in the previous response.