In this project, we will explore various data visualization and machine learning tasks using Python libraries like scikit-learn, Seaborn, and Matplotlib.
- Reimplement the Iris dataset clustering visualization from Chapter 15, but this time, perform dimensionality reduction using scikit-learn's TSNE estimator. Visualize the results and compare the clusters to the ones created in the clustering case study.
- Create a Seaborn pair plot graph for the California Housing dataset. Explore the Matplotlib features for panning and zooming within the diagram to analyze the data more effectively.
- The Iris dataset is labeled, making it suitable for supervised machine learning. Load the Iris dataset and perform classification using the k-nearest neighbors algorithm with a
KNeighborsClassifier
and the defaultk
value. Report the prediction accuracy.
- Investigate the Diabetes dataset bundled with scikit-learn. Reimplement the steps of the multiple linear regression case study from Chapter 15.5. This dataset contains 442 samples, each with 10 features and a label indicating "disease progression one year after baseline."
- Research and load the Titanic Disaster dataset from the RDatasets repository. Use the
DecisionTreeClassifier
to build a decision tree for predicting whether a passenger survived or died. Output the decision tree in the DOT graphing language usingexport_graphviz
. Visualize the decision tree using Graphviz.